NK and block models of fitness landscapes
August 25, 2013 8 Comments
On February 12, 2001, the human genome project released its first formal report. It was a great day for biology, and a wonderful birthday present for Darwin. However, it was also a humbling confrontation with complexity and prompted Steven Jay Gould to write in the New York Times:
[The Human Genome project revealed that] Home sapiens possesses between 30,000 and 40,000 … [only twice as many as the tiny roundworm] … The key to complexity is not more genes, but more combinations and interactions generated by fewer units of code … [their function] cannot be predicted from the separate underlying parts alone.
The one gene to one protein paradigm was overthrown, the 40k genes were simply not enough to generate the over 140k identified proteins in the human body. For evolutionary biology, the new view meant that changing one gene could affect (and would affect, in most cases) many proteins and thus have several different contributions to fitness. Biologists needed a model of fitness landscapes where the genes could interact with each other, allow for varying combinations, and manifest epistasis.
Thankfully, over 10 years earlier, Kauffman and colleagues developed just such a model — the NK model of rugged fitness landscapes (Kauffman & Levin, 1983; Kauffman & Weinberger 1989; Kauffman, 1993). Instead of fitness depending on individual genes, it depended on collections of genes; the linear genome was extended with an extra gene interaction structure. Every locus was linked to some small number of other loci, and its contribution to fitness depended not only its allele but also on the alleles of the loci to which it was linked.
The NK-model is a fitness landscape on . The n loci are arranged in a gene-interaction network where each locus is linked to K other loci and has an associated fitness contribution function . Given a vertex , we define the fitness .
By varying K we can control the amount of epistasis in the landscape. With K = 0 we have a smooth landscape, and for higher K we can treat the fitness contributions as generalizations of the two loci epistasis described in the previous section. The model also provides an upper bound of on the number of gene pairs that have epistatic interactions.
The NK-model is frequently studied through simulation, or statistical mechanics approaches. In a typical biological treatment, the gene-interaction network is assumed to be something simple like a generalized cycle (where is linked to ) or a random K-regular graph. The fitness contributions are usually sampled from some choice of distribution. As such, we can think of biologists as doing average case analysis of the fitness landscapes. Since there is — as we will saw earlier — no empirical or theoretically sound justification for the choice of distributions.
I think we should eliminate the distributions and focus on worst-case analysis. In some ways this is inspires by but also contrary to the Orr-Gillespie theory. Orr-Gillespie eliminate the gene interaction structure and try to minimize the assumptions about the distribution; the computer science approach is to eliminate distributions and focus on the structure: what are the most extreme cases that can happen if we suppose that at most K genes interact, but place no other restrictions?
Weinberger (1996) showed that checking if the global optimum in an $NK$ model is greater than some input value V is NP-complete for . Although this implies that finding a global optimum is difficult, it says nothing about local optima. As such, it has generated little interest among biologists, although it spurred interest as a model in the evolutionary algorithms literature, leading to a refined proof of NP-completeness for (Wright et al., 2000). This is in sharp contrast to random wiring.
If the gene interaction network was wired randomly, the genes had relatively short paths for influencing each other. In a random graph for instance, any gene could ‘signal’ any other gene in about log n steps. The random structure of the interaction networks also allows population to converge quickly to evolutionary equilibria and permits no modularity (beyond what is present in any random structure by chance) in the genome.
When we look at biology, however, especially at the level of things like metabolic pathways, there is a lot of reuse of basic components and computations. We see modularity, and the easiest way to include it in a gene network is to cut it up into non-interacting parts — the block model (Perelson & Macken, 1995). To make a block model, we divide the gene interaction network into m non-interacting parts of size n/m; in other words, if one gene is in one block and another in a second then there is absolute no epistatic interaction between them, even through other genes as intermediaries. For example, in the limit of m = n, we just get one locus per block and no interactions, recovering the smooth landscape model.
For larger block sizes, we can use any model we want to initialize the individual blocks. For instance, we can have each block be created as an instance of the NK-model. Since the blocks don’t interact, the length of any adaptive walk will be just the sum of lengths of adaptive walks carried out in the individual blocks. This can actually be used to create some lower bounds on the length of reasonable adaptive paths through an estimate of modularity. For example, the human genome has around base pairs and $4 \times 10^4$ genes; so there is a quadratic relationship between the two. In our block model, this would mean that it is reasonable to assume that we have roughly blocks, each of genes. This means that no matter how fast (say T steps for a block of size ) an individual block can adapt, the whole genome will still adapt only at a rate of . This means that the fast logarithmic rate adaptations that are sometimes observed within a single gene, are irrelevant for genome wide adaptation where they are slowed down by the block structure. It also explains Orr’s (2006) result for why Orr-Gillespie adaptation is slower in a block structured genome.
This is the fourth of a series of posts on computational considerations of empirical and theoretical fitness landscapes. In the previous post I gave a historic overview of fitness landscapes as mental and mathematical metaphors, highlighted our empirical knowledge of them, and explained the Orr-Gillespie theory for minimizing statistical assumptions.
Kauffman, S.A. (1993). The Origins of Order: Self-Organization and Selection in Evolution. Oxford University Press.
Kauffman S, & Levin S (1987). Towards a general theory of adaptive walks on rugged landscapes. Journal of Theoretical Biology, 128 (1), 11-45 PMID: 3431131
Kauffman, S.A., & Weinberger E.D. (1989) The n-k model of rugged fitness landscapes and its application to maturation of the immune response. Journal of Theoretical Biology, 141:211-245, 1989.
Orr, H.A. (2006). The population genetics of adaptation on correlated fitness landscapes: the block model. Evolution 60(6): 1113-1124.
Perelson, A.S., & Macken, C.A. (1995). Protein evolution of partially correlated landscapes. Proc. Natl. Acad. Sci. USA. 92:9657-9661.
Weinberger, E.D. (1996). NP completeness of Kauffman’s N-k model, a tunably rugged fitness landscape. Santa Fe Institute working paper: 1996-02-003.
Wright, A.H., Thompson, R.K., & Zhang, J. (2000). The computational complexity of N-K fitness functions. Evolutionary Computation, IEEE Transactions on 4(4): 373-379.