# Misleading models: “How learning can guide evolution”

I often see examples of mathematicians, physicists, or computer scientists transitioning into other scientific disciplines and going on to great success. However, the converse is rare, and the only two examples I know is Edward Witten’s transition from an undergad in history and linguistics to a ground-breaking career in theoretical physicist, and Geoffrey Hinton‘s transition from an undergrad in experimental psychology to a trend setting career in artificial intelligence. Although in my mind Hinton is associated with neural networks and deep learning, that isn’t his only contribution in fields close to my heart. As is becoming pleasantly common on TheEGG, this is a connection I would have missed if it wasn’t for Graham Jones‘ insightful comment and subsequent email discussion in early October.

The reason I raise the topic four months later, is because the connection continues our exploration of learning and evolution. In particular, Hinton & Nowlan (1987) were the first to show the Baldwin effect in action. They showed how learning can speed up evolution in model that combined a genetic algorithm with learning by trial and error. Although the model was influential, I fear that it is misleading and the strength of its results are often misinterpreted. As such, I wanted to explore these shortcomings and spell out what would be a convincing demonstration of a qualitative increase in adaptability due to learning.

Hinton & Nowlan (1987) consider the following artificial model:

• Agents are strings of length 20 over the alphabet {0,1,?}.
• There is a single ideal target phenotype x* that has 20 times more fitness than any other phenotype. In order to achieve the 20 time boost of the ideal, every single bit of the agents phenotype must agree with the ideal one, even a single difference results in the minimal fitness. Thus, the landscape is completely flat except for one spike at x*.
• If a letter in an agent’s string is 0 or 1 then it is locked in, and not subject to learning. If a letter is ? then it is modifiable learning. The agent learns by setting all values of ? randomly, and setting the subsequent payoff, if the reward is the minimum then the agent continues to exploring, if the reward is the maximum then in all future trials, the agent uses this setting. There are 1000 trials of learning for each agent, and the agent’s fitness is cumulative over all trials. Note that this means that if an agent has some of the locked in bits set incorrectly, then that agent will never learn the high fitness phenotype.
• None of the learning is passed on to children, so a child inherit only the locked in 0s and 1s, but continue to have ?s in all the same places as their parent. Reproduction is proportional to fitness and sexual. Every child is a recombination of its parents’ strings, taking the value of each letter randomly from one of the parents.
• The authors proceed by simulation, and seed it with 1000 individuals, with each letter being a ? with probability 0.5, and a 0 or 1 with probability 0.25 each.

Unsurprisingly, Hinton & Nowlan (1987) find that with learning, the system is able to converge to a state with a little bit over 50% of the letters fixed in the right place, and the rest as question marks. Without learning, the system never finds the high fitness phenotype in their simulations and the authors use this as evidence to suggest that learning can qualitatively speed up evolution. Exciting, right? That is what the genetic algorithms, and (to a much lesser extent) biological communities thought of the authors work, garnering the authors nearly 1000 citations to date. The work inspired much subsequent development on the intersection of learning and evolution, and I am extremely happy for that.

Unfortunately, the paper is also a prime example of the curse of computing, and using seemingly innocent modeling assumption to mislead the casual reader. Lets put on our algorithm lenses and see why.

To show a qualitative shift, we need to have an idea of the functional form of how quickly we find the optimum. In this setting, it does make sense to show a formal qualitative effect can be comparing just two specific numbers. Thus, the first step is to generalize the model to agents with n letter genomes (in the specific case the authors analyzed, n = 20), with a single marked high-fitness vertex in the fitness landscape. If we ignore the point-mutation structure and only look at fitness, then this is the problem of un-ordered search, and if you want to find the single marked vertex using queries (fitness evaluations) then it is a trivial observation that you will need $\Omega(2^n)$ queries, no matter what (classical; Grover search can give you $\Omega(\sqrt{2^n})$ on a quantum computer) algorithm you use. In other words, there is no efficient algorithm that can find the fitness maximum. Adding the constraint of specific algorithm like only point mutations or random setting of some bits, only makes the task more difficult. However, it is not a qualitative increase in difficulty because a random walk on the point-mutation hypercube (or point-setting of question marks) is still expected to find $x^*$ in about $O(n^2 2^n)$ which is qualitatively the same as the exhaustive search of $2^n$.

So how does Hinton & Nowlan’s (1987) evolutionary and learning algorithm find the peak in just 20 generations? This is where the computational model really misleads us. The choices of 1000 agents in population, 1000 trials of learning per agent, and 20 letter seem arbitrary and inconsequential, but they are actually meticulously picked to mislead. If the authors presented an analytic treatment (which is easy to do) instead of simulation, then it would have been obvious. With 1000 agents (or about $2^{n/2}$ in the general case), and 50% chance of locked bits, there is a relatively high probability of having one agent that has the locked bits (which will average 10) picked correctly. With 1000 learning trials, there is also a good probability that the agent will stumble on the correct setting for the other 10 bits (there are $2^{10} = 1024$ ways to assign 10 bits). Also, with 1000 agents, 1000 learning trials per agent, the result is $10^6 \approx 2^{20}$ queries to the fitness function per generation. Thus, it is no surprise that when you give the population that many queries, it can find the peak.

How would evolution fare by itself? Well, Hinton & Nowlan handicapped evolution in two ways. First, without learning, each agent only samples 1 phenotype per generation, hence, the population samples only 1000 genotypes per generation. You would need about 1000 generations of without-learning evolution to simulate one generation of with-learning. Apart from that, there will be no differences if we considered asexual reproduction. Evolution without learning would converge just as quickly to the optimal phenotype as evolution with learning in terms of samples of the fitness functions. However, this is where their second handicap comes in, by using sexual reproduction, they make it very difficult for an agent to remain at the optimal genotype, because recombination will always ruin it. This is an artifact of recombination being a poor strategy for such artificial landscapes, and not some interesting computational property of nature; although we are starting to get a computational theory for the effect of sex on the definition of fitness peaks (Livnat et al., 2008; 2011). If we wanted to introduce a similar handicap for the combined strategy of evolution + learning, them we could do so by introducing two optimal genotypes (instead of just one) that differed from each other in ~10 letters.

“Surely, the thousand fold increase must count for something” you object. This is where their last implicit lie comes in. By plotting generations as their time parameter, instead of number of samples, they hugely misrepresent the model, not just algorithmically but biologically. Why would you expect an individual that gets to try 1000 different strategies, and evaluate their outcomes on his fitness, to reproduce as quickly as an agent that never tries to learn.

To make Hinton & Nowlan’s (1987) result scale, we would need an exponentially large population of about $2^{n/2}$ agents with each agent living exponentially long for about $2^{n/2}$ learning trials. Although it makes sense to increase the lifetime of an agent along side with its genome, it is hard to justify an exponential relationship. Since we usually think of DNA copying as a linear process, it makes sense for the organisms lifetime to be lower bounded by a linear function (i.e. $\Omega(n)$). Hence, I can believe in agents whose lifetime scales as some polynomial in n, but exponential scaling is too Methuselahian for me.

For the number of agents, a case can be made for exponential size. Since reproduction is doubling, it is interesting to imagine a world of unbounded resources where it can continue indefinitely leading to an exponential population. Given the previous assumptions, this actually results in an associated complexity class that isn’t completely unreasonable — PSPACE. However, an increase in population size increases the power of evolution as much (if not more) as learning. Under such settings it becomes pretty much impossible to separate the two, for example even if the agents only had a constant amount of computation (were simple gates), we would still get the same complexity class. Further, it seems to be a little bit cheating to allow for a population so large that the standing variation can account for most of the possible genotypes. If we account for limited resources, then the population will also be bounded by a polynomial in n. In that case, for Hinton & Nowlan’s (1987) landscape neither evolution nor evolution with learning will find the fitness peak of 20. Instead, the population will always be stuck at the endless local optimum plateau of 1 fitness.

In other words, I don’t think that Hinton & Nowlan (1987) established that learning can speed up evolution, although they didn’t give a nice push. However, their paper and its big uptake by other researchers, highlights how easy it is to be mislead by simulations if you don’t compare your results against a sanity-check analytic model. The most innocuous looking assumption (like the population size and number of learning trials, in this case) can actually be knife-edge and determine the results. In such cases, a more general treatment can save us, and point the way forward — another example of how cstheory can help biology.

Since biologists don’t typically look at asymptotics and rigorous qualitative distinctions, I don’t think there exist example of learning speeding up evolution that would satisfy me. In particular, ones where without learning convergence to a local optimum is exponentially slow, but with learning it is doable in polynomial time. My results on the difficulty of finding local optima in the NK-model (Kaznatcheev, 2013), unfortunately, do not give an obvious way forward. My results are for any polynomial time mechanism, including ones that take advantage of learning; we won’t get a natural separation there even if allow the more powerful Lamarckian learning. Semi-smooth landscapes might be a better candidate, since there hardness is only for certain kinds of adaptive walks, but in the current generality I am using (i.e. genral AUSOs instead of actual linear polytopes) there is no efficient algorithm like the ellipsoid method known to which learning can turn for inspiration. I think the real way to achieve a separation — and one of the things I am working on right now — is to modify Valiant’s (2009) model to handle individual learning instead of ignoring it.

Unfortunately, even this approach and all the other speedups I know (for example: Papaj, 1994; Anderson, 1995; Dopazo et al., 2001; Borenstein et al., 2006; Paenke et al., 2007) rely on learning being more powerful than evolution: evolutionary dynamics are a subset of possible dynamics allowed to an undying learner. Anything evolution can do, learning can do better: an immortal learner would perform better than an evolving population. This means that we can think of the cup as half-empty: evolution (i.e. mortality) is slowing down learning. I think that it would be much more interesting to see cases of synergy where learning and evolution together are qualitatively faster than either is by themselves even in idealized immortal learners. This will rely on more subtle thinking that smoothing of fitness landscapes, and a careful analysis of how learning queries the fitness functions. In particular, it will be important to question how learning gets access to certain statistical regularities of the environment. However I do think that both qualitative speed-ups of evolution due to learning and more subtle synergetic effects are possible. Stay tuned!

### References

Anderson, R. W. (1995). Learning and evolution: A quantitative genetics approach. Journal of Theoretical Biology, 175(1): 89-101.

Borenstein, E., Meilijson, I., & Ruppin, E. (2006). The effect of phenotypic plasticity on evolution in multipeaked fitness landscapes. Journal of Evolutionary Biology, 19(5): 1555-1570.

Dopazo, H., Gordon, M. B., Perazzo, R., & Risau-Gusman, S. (2001). A Model for the Interaction of Learning and Evolution. Bulletin of Mathematical Biology, 63(1): 117-134.

Hinton, G. E., & Nowlan, S. J. (1987). How learning can guide evolution. Complex Systems, 1 (3), 495-502

Kaznatcheev, Artem (2013). Complexity of evolutionary equilibria in static fitness landscapes. arXiv: 1308.5094v1

Livnat, A., Papadimitriou, C., Dushoff, J., & Feldman, M. W. (2008). A mixability theory for the role of sex in evolution. Proc. Natl. Acad. Sci. USA 105(50): 19803-19808.

Livnat, A., Papadimitriou, C., & Feldman, M.W. (2011). An analytical contrast between fitness maximization and selection for mixability. Journal of Theoretical Biology 273(1): 232-234.

Paenke, I., Sendhoff, B., & Kawecki, T. J. (2007). Influence of plasticity and learning on evolution under directional selection. The American Naturalist, 170(2): E47-E58.

Papaj, D. R. (1994). Optimizing learning and its effect on evolutionary change in behavior. Behavioral mechanisms in evolutionary ecology, 133-153.

Valiant, L.G. (2009) Evolvability. Journal of the ACM, 56(1): 3.

From the ivory tower of the School of Computer Science and Department of Psychology at McGill University, I marvel at the world through algorithmic lenses. My specific interests are in quantum computing, evolutionary game theory, modern evolutionary synthesis, and theoretical cognitive science. Previously I was at the Institute for Quantum Computing and Department of Combinatorics & Optimization at the University of Waterloo and a visitor to the Centre for Quantum Technologies at the National University of Singapore.

### 5 Responses to Misleading models: “How learning can guide evolution”

Anything evolution can do, learning can do better: an immortal learner would perform better than an evolving population.

Well, individual learners do perform better as long as they’re approaching the closest maximum. They just don’t perform better over arbitrary periods of time, because the ability to “learn” tends to reduce their adaptability once they arrive there.

That is, immortal learners can be ES only in a stable fitness landscape — they start losing their adaptive advantage as soon as they reach a peak. Since any form of computation has an energy cost, the same “dissipative” flow that powered the learning engine on the way up continues to refine the learning structure toward ever more efficient configurations once it arrives.

Of course efficient is just another way to say fragile. Over time the ability to “climb down or across” a fitness gradient in any given dimension of the landscape is increasingly likely to have been lost to optimization, and it becomes increasingly difficult to adopt a slightly-higher-entropy configuration without falling all the way into equilibrium.

I was recently trying to figure out if there was a name for this pathology, so the pattern is fresh in my mind. It does seem to show up in a lot of different types of process-control systems.

Anyway, if Life On Planet Earth is any indication, “intermittently lossily-compressed” learners (genetic transmission being the lossily-compressed configuration of an immortal learner) seem to outperform continuous (aka immortal and low-loss) learners eventually. Whether that has much, or anything, to do with the unnamed pathology I couldn’t say.

In fact I don’t know how one would investigate it other than with some sort of inevitably-biased simulation. :-)

Very much looking forward to the “analysis of how learning queries the fitness functions” part BTW. Thank you for your many excellent posts.

• Thank you for the comment!

Although I am inclined to agree with you that any learning procedure will inherently have a metabolic cost, I think it is more interesting to show when learning is ineffective even when metabolically free. This is what I am concentrating on.

I don’t understand some of your statements, though. For instance:

That is, immortal learners can be ES only in a stable fitness landscape — they start losing their adaptive advantage as soon as they reach a peak

In the second part of the sentence, you are just restating the second phase of the Baldwin effect. A static fitness landscapes (I assume that is what you meant by stable) only makes this easier, so I don’t see how the first part of the sentence is a consequence of the second.

I don’t understand the following paragraph at all:

Of course efficient is just another way to say fragile. Over time the ability to “climb down or across” a fitness gradient in any given dimension of the landscape is increasingly likely to have been lost to optimization, and it becomes increasingly difficult to adopt a slightly-higher-entropy configuration without falling all the way into equilibrium.

Could you elaborate on what you meant here? In which sense of the word are you using ‘efficient’ and ‘fragile’? For me, the first word usually means “using few of a restricted resource, usually time, space, or energy” (I usually concentrate on time), and the second means “failing to achieve an approximation to its goal with slight perturbation of the inputs, environment, or microdynamics”. I don’t see how the two notions are the same. In fact, most of the efficient learning algorithms I know, are usually very robust, for example.

Thank you for inviting me to think out loud in your comment section! Sorry about the delayed and lengthy reply.

Regarding efficiency being associated with fragility, I certainly didn’t mean to imply that they’re the same notion, but I’m also a little surprised to get pushback on it since it’s a fairly common observation in vivo. Given the definitions you provide, what I’m suggesting is that observed values of “slight” for “slight perturbation” tend to be smaller for systems which are more efficient, not via any direct causal mechanism, but as a side effect of the mechanism by which adaptive systems “seek” more efficient configurations.

Consider for example a trivial model of an organism with a behavioral function f operating on sensory inputs s such that f(s) yields behavior b, plus a “learning” function g which operates on f such that g(f1) yields f2. (I’m going to go ahead assume that there’s some non-zero metabolic cost associated with computation and sensory processing.)

For any given efficiency-maximizing fitness function a, operating on b, the direct effect is to drive the organism toward an optimal g that yields optimal f.

But there’s also the indirect effect of coupling f() and g() ever more tightly to the particular environmental variations that exists at the optimum at a given time. When s has static or nearly-static components x such that f(x,s) ~ f(s), a more efficient but inherently “near-by” configuration for the system as a whole is to drop consideration of values in x as inputs and treat them as a set of constants or, ideally, a single constant.

The problem of course is that the fitness function is merely an expression of the “Natural Laws” of whatever landscape we happen to be in, whereas g and f have metabolic costs. Given that a landscape can have arbitrarily high dimensionality, and the possible components of s can be an arbitrarily large set, the system will — in the interest of efficiency — tend to converge on algorithms for f which ignore static components of the environment. As the set of inputs s grows smaller, a minor change in x may affect the value of a(b) such that g() starts to yield values of f which are catastrophically maladaptive.

This is the hopefully not very controversial mechanism that I was referring to when I described efficiency and resilience as tending to vary inversely.

More specifically though, the thing that’s informed my thoughts about this (and one of the reasons I find your “algorithmic lens” perspective so interesting) was the realization that one can think of a genotype as being a “lossily compressed” phenotype.

There are various implications to this. In particular, one that seems relevant to your exploration of inter-relationships between learning and evolution, is that what we ordinarily think of as “evolutionary” adaptations vs “cognitive” adaptations are better thought of as having different degrees of lossiness and continuity than as being in different categories. One or the other may be more suitable over different intervals or landscapes, but in principle they are commutative.

A far more speculative implication derived from all of the above is that this “fragility attractor” effect might, in practice, limit the usefulness of “lossless” or “low-noise” learning over long periods of time, and discourage the evolution of any species which could be considered “immortal learners” for the purposes of this conversation. That’s why I was trying to find a name for the pathology, and I would be interested if hearing what you think of that idea.

2. I’ve recently been reading a lot of related papers and find the case that Geoff Hinton makes here very compelling. He has the skill to distill a problem to its bare essentials and make a strong point. In my humble opinion I think its not justified to make comments such as “implicit lie” or “meticulously picked to mislead” as Geoff is clear and up front about these decisions in the paper. Even so, I thank you for making me think critically about this paper and you make some convincing arguments.

After some though I think in some ways I think you have deliberately chosen to mis-understand the point of paper by making it something it isn’t. You think it should be fair comparison between learning and evolution, but this is not the intention at all. The point is that learning gives a huge advantage precisely because it can perform many trials within the lifetime of an organism, whereas evolution only gets single shot, even when the signal from the learning is reduced to the only possible effect in can have on evolution. The central point in that learning smooths out the decision surface by giving a “you’re getting closer” signal which is immensely valuable to guide evolution.

Your comments have inspired me the explore further and I for my own interest I will be implementing Geoff’s method and trying out a range of different parameters to see when the benefit of learning fails to show, perhaps when the likelihood of the finding the optimum gets too small evolution could crucially lose some of correct switch settings. This will be interesting to see.