Software through the lens of evolutionary biology
October 19, 2013 5 Comments
My preferred job title is ‘theorist’, but that is often too ambiguous in casual and non-academic conversation, so I often settle for ‘computer scientist’. Unfortunately, it seems that the overwhelming majority of people equate computer scientists to programmers or some general ‘tech person’, forgetting M.R. Fellows rallying cry: “Computer science is not about machines, in the same way that astronomy is not about telescopes.” Although — like most theorists — I know how to program, the programming I do is nothing like what (I hear) is in industry. In particular, all of my code is relatively small and with concentration, or maybe a single sheet of paper, I can usually keep the whole thing in my head. In fact, the only time I’ve worked in a large code base was developing extensions for MediaWiki during my first summer of college to be used by some groups at the Canadian Light Source. Combined with the preceeding semester of drawing UML diagrams and writing up req&spec documents, I was convinced that I would never be a software engineer. However, I did learn a valuable lessons: real world projects are big and unwieldy, logistics have to be taken seriously, comments and documentation are your friends, and for a sufficiently large software project there is no single person that knows the whole code.
With that much unknown, it is not surprising that bugs abound. Even back in 2002 software bugs cost the US $59.5 billion annually or 0.6% of the GDP, and I imagine the cost has only gone up. If you count ultrafast extreme events or flash crashes of automated hight-frequency traders as bugs, then some argue that you have part of our recent financial crises to blame on software errors (Johnson et al., 2013). To get a feel for the numerosity, a big project like Mozilla Firefox can easily get 2000 new bugs in a year (see figure at left), and Yet most of these bugs are not particularly difficult, and don’t require major overhauls to fix. Even the most serious failures can be fixed by a 12 year-old, why not let evolution have a go?
Usually on TheEGG, I advocate using the algorithmic lens to view fields like biology and ecology. Stephanie Forrest of University of New Mexico and Santa Fe Institute, is suggesting the dual: lets use ideas from biology and ecology to come to grips with computer science, or at least software engineering. In particular, she has developed with her co-authors GenProg — a system for evolutionary program repair (Weimer, et al., 2009; Le Goues, et al., 2012a,b). She described her work at the recent Stanislaw Ulam Memorial Lecture Series (attention conservation note: this is an hour long lecture, watch it if you find this topic particularly fascinating. However, I will make the post largely self-contained and you can read on without watching the whole video):
Overall, genetic programming has a pretty bad rap, lots of promises, few successes, and inherits little theory from more general genetic algorithms. Part of the disappointed has stemmed from a focus on the programming equivalence of abiogenesis (let’s call it aprogrammagenesis), starting with an empty program and hoping to evolve one that can solve some task that we don’t quiet know how to approach — obviously, since programmers are still employed, wide successes have not been had. This is where we can take our first idea from biology, a key insight of Orr-Gillespie theory (Gillespie, 1991; Orr, 2002) is that the wild type is already highly adapted for its environment. In terms of programs: look at with what programmers wrote, it already performs decently, and look at variations on this ‘wild-type’ programs. This is Forrest and colleagues’ starting point for GenProg.
Le Goues et al. (2012a) define fitness in terms of a set of positive (functionality that should not be lost) and negative (faults to be repaired) test-cases. The more test-cases you satisfy, the higher your fitness. A program is represented as an abstract syntax tree and weighted path. The tree contains the program automatically decomposed into a machine friendly tree format, and the weighted path associates with each statement in the tree a weight based on how often that statement occurs in trace runs of the test cases. In practice, there can be thousands of test cases, or we could even use random testing and thus produce something a-kin to Valiant’s (2009) model of evovability — although supplying a membership oracle is a little more subtle in practice than in theory. But for simplicity, the authors considered only two to six positive test cases and a single negative test case (usually based on vulnerability reports) that is weighed as 10 times more important. This results in only four to twelve possible fitness values, and a holey adaptive landscape — most mutations leave fitness unchanged, and some greatly reduce fitness.
To have any hope of finding the bug, Le Goues et al. (2012a,b) depart from biology and bias mutations to make changes to lines of high weight (visited by many negative executions and few positive ones) more likely. They also only consider line-level changes of either a deletion, an insertion, or a swap/replace. For insertions and swaps, they pick a line uniformly at random from the program to place after (for insertion) or replace (for swap) the mutating line. This mutation mechanism really exploits Orr-Gillespie theory, because it assumes that the high fitness program already has the right ideas somewhere in it to fix the bugs, and they were just omitted by the programmer in the case of the erring path. However, this also means that the process cannot innovate new types of solutions, and so is not directly useful for AI’s dream of aprogrammagenesis. Finally, since Forrest was a PhD student of John Holland, her genetic algorithm have to be sexual and incorporate crossover. Each offspring experiences crossover in the form of taking two execution paths that are used by at least one test case, picking a random point along the paths and swapping all statements after that cut point. Again, this is a cross-over that is biased in a biologically unreasonable way to exploit some domain knowledge about how typical software bugs crop up.
Due to the nature of the neutral landscape, mutations, and recombination, when the genetic algorithm terminates, the resulting fittest mutants usually contain orders of magnitude more changes than necessary to fix the bug. The authors minimize this code by looking at a structured diff with the original program, and checking each new line to see if it changes behavior on some test case. If the line does not affect a test case then it is removed, effectively undoing the neutral meandering that the evolution underwent. This trimmed solution is then presented to a programmer for evaluation. Le Gous et al. (2012b) tested this evolutionary paradigm by finding candidate programs that had at least 50,000 lines of C, 10 viable test cases, and 300 versions in a revision control system, these programs were: fbc, gmp, gzip, libtiff, lighttpd, php, python, wireshark. They ran their evolutionary algorithms on Amazon’s cloud servers for upto 12 hours per bug ($8.80 max spent of computing time), and ended up fixing 55/105 bugs at an average cost of $7.32. I imagine that one would be hard-pressed to find a programmer willing to work that cheap.
The important question for me is not if this works, but why it works. All the results in the previous paragraph involved populations of 40 programs with just 10 generations, an evolutionary leap similar to the speed of evolutionary dynamics in the human immune system’s affinity maturation response to finding a bug in its function. Given how hard finding a local fitness peak is for general fitness landscapes (Kaznatcheev, 2013), what are the special features of the software fitness landscape that make it so friendly to adaptation? Schulte et al. (2013) answer this by specializing the arguments of evolutionary economics to software:
Today’s software arose through fifty years of continued use, appropriation, and refinement by software developers. The tools, design patterns and codes that we have today are those that have proven useful and were robust to software developer’s edits, hacks, and accidents, and those that survived the economic pressures of the marketplace. We hypothesize that these evolutionary pressures have caused software to acquire mutational robustness resembling that of natural systems.
In particular, the authors show that in the associated fitness landscape created by their mutation operators and the test suites, an average of 36.8% and a minimum of 21.2% (across 22 real-world programs involving 150,000 lines of code and 23,151 tests) of mutations are neutral. Since the hardness results in Kaznatcheev (2013) rely on non-neutral landscapes, neutrality could be an answer to the speed and effectiveness of evolutionary bug-fixing, but I don’t think it is the whole story. I think it is worthwhile to look more than one mutation out, and look at the prevalence of epistasis in the software fitness landscape. I think this is the greatest contribution software engineering can make to biology. In evolutionary biology, the notoriously difficulty of getting good data on the structure of empirical fitness landscapes has lead to a disconnect between theory and experiment (Orr, 2005; Kryazhimskiy et al., 2009), why not test our biologist theories against the landscapes observed in the digital ecosystems created by humans? Ecologist are already trying this by applying ecological theories to the detailed data of the financial ecosystem (May et al., 2008), why not do the same for evolutionary biology and use software engineering as an artificial laboratory for studying the structure of empirical fitness landscapes?
Gillespie, J.H. (1991). The causes of molecular evolution. Oxford University Press.
Johnson, N., Zhao, G., Hunsader, E., Qi, H., Johnson, N., Meng, J., & Tivnan, B. (2013). Abrupt rise of new machine ecology beyond human response time. Scientific Reports, 3.
Kaznatcheev, A (2013). Complexity of evolutionary equilibria in static fitness landscapes. arXiv: 1308.5094v1.
Kryazhimskiy, S., Tkacik, G., & Plotkin, J.B. (2009). The dynamics of adaptation on correlated fitness landscapes. Proc. Natl. Acad. Sci. USA 106(44): 18638-18643.
Le Goues, C., Nguyen, T., Forrest, S., & Weimer, W. (2012a). GenProg: A generic method for automatic software repair. Software Engineering, IEEE Transactions on, 38(1), 54-72.
Le Goues, C., Dewey-Vogt, M., Forrest, S., & Weimer, W. (2012b). A systematic study of automated program repair: Fixing 55 out of 105 bugs for $8 each. In Software Engineering (ICSE), 2012 34th International Conference on (pp. 3-13). IEEE.
May R.M., Levin S.A., & Sugihara G. (2008). Ecology for bankers. Nature, 451(7181): 893-5.
Orr, H.A. (2002). The population genetics of adaptation: the adaptation of DNA sequences. Evolution 56: 1317-1330.
Orr H.A. (2005). The genetic theory of adaptation: a brief history. Nature Reviews Genetics, 6(2).
Schulte, E., Fry, Z. P., Fast, E., Weimer, W., & Forrest, S. (2013). Software Mutational Robustness. Journ. Genetic Programming and Evolvable Machines. DOI: 10.1007/s10710-013-9195-8
Valiant, L.G. (2009) Evolvability. Journal of the ACM 56(1): 3.
Weimer, W., Nguyen, T., Le Goues, C., & Forrest, S. (2009). Automatically finding patches using genetic programming. In Proceedings of the 31st International Conference on Software Engineering (pp. 364-374). IEEE Computer Society.