# NK and block models of fitness landscapes

On February 12, 2001, the human genome project released its first formal report. It was a great day for biology, and a wonderful birthday present for Darwin. However, it was also a humbling confrontation with complexity and prompted Steven Jay Gould to write in the New York Times:

[The Human Genome project revealed that] Home sapiens possesses between 30,000 and 40,000 … [only twice as many as the tiny roundworm] … The key to complexity is not more genes, but more combinations and interactions generated by fewer units of code … [their function] cannot be predicted from the separate underlying parts alone.

The one gene to one protein paradigm was overthrown, the 40k genes were simply not enough to generate the over 140k identified proteins in the human body. For evolutionary biology, the new view meant that changing one gene could affect (and would affect, in most cases) many proteins and thus have several different contributions to fitness. Biologists needed a model of fitness landscapes where the genes could interact with each other, allow for varying combinations, and manifest epistasis.

Thankfully, over 10 years earlier, Kauffman and colleagues developed just such a model — the NK model of rugged fitness landscapes (Kauffman & Levin, 1983; Kauffman & Weinberger 1989; Kauffman, 1993). Instead of fitness depending on individual genes, it depended on collections of genes; the linear genome was extended with an extra gene interaction structure. Every locus was linked to some small number of other loci, and its contribution to fitness depended not only its allele but also on the alleles of the loci to which it was linked.

The NK-model is a fitness landscape on $\{0,1\}^n$. The n loci are arranged in a gene-interaction network where each locus $x_i$ is linked to K other loci $x^i_1,...,x^i_k$ and has an associated fitness contribution function $f_i: \{0,1\}^{K + 1} \rightarrow \mathbb{R_+}$. Given a vertex $x \in \{0,1\}^n$, we define the fitness $f(x) = \sum_{i = 1}^n f_i(x_ix^i_1...x^i_k)$.

By varying K we can control the amount of epistasis in the landscape. With K = 0 we have a smooth landscape, and for higher K we can treat the fitness contributions as generalizations of the two loci epistasis described in the previous section. The model also provides an upper bound of $n{K + 1 \choose 2}$ on the number of gene pairs that have epistatic interactions.

The NK-model is frequently studied through simulation, or statistical mechanics approaches. In a typical biological treatment, the gene-interaction network is assumed to be something simple like a generalized cycle (where $x_i$ is linked to $x_{i + 1},...x_{i + K}$) or a random K-regular graph. The fitness contributions $f_i$ are usually sampled from some choice of distribution. As such, we can think of biologists as doing average case analysis of the fitness landscapes. Since there is — as we will saw earlier — no empirical or theoretically sound justification for the choice of distributions.

I think we should eliminate the distributions and focus on worst-case analysis. In some ways this is inspires by but also contrary to the Orr-Gillespie theory. Orr-Gillespie eliminate the gene interaction structure and try to minimize the assumptions about the distribution; the computer science approach is to eliminate distributions and focus on the structure: what are the most extreme cases that can happen if we suppose that at most K genes interact, but place no other restrictions?

Weinberger (1996) showed that checking if the global optimum in an $NK$ model is greater than some input value V is NP-complete for $K \geq 3$. Although this implies that finding a global optimum is difficult, it says nothing about local optima. As such, it has generated little interest among biologists, although it spurred interest as a model in the evolutionary algorithms literature, leading to a refined proof of NP-completeness for $K \geq 2$ (Wright et al., 2000). This is in sharp contrast to random wiring.

If the gene interaction network was wired randomly, the genes had relatively short paths for influencing each other. In a random graph for instance, any gene could ‘signal’ any other gene in about log n steps. The random structure of the interaction networks also allows population to converge quickly to evolutionary equilibria and permits no modularity (beyond what is present in any random structure by chance) in the genome.

When we look at biology, however, especially at the level of things like metabolic pathways, there is a lot of reuse of basic components and computations. We see modularity, and the easiest way to include it in a gene network is to cut it up into non-interacting parts — the block model (Perelson & Macken, 1995). To make a block model, we divide the gene interaction network into m non-interacting parts of size n/m; in other words, if one gene is in one block and another in a second then there is absolute no epistatic interaction between them, even through other genes as intermediaries. For example, in the limit of m = n, we just get one locus per block and no interactions, recovering the smooth landscape model.

For larger block sizes, we can use any model we want to initialize the individual blocks. For instance, we can have each block be created as an instance of the NK-model. Since the blocks don’t interact, the length of any adaptive walk will be just the sum of lengths of adaptive walks carried out in the individual blocks. This can actually be used to create some lower bounds on the length of reasonable adaptive paths through an estimate of modularity. For example, the human genome has around $3 times 10^9$ base pairs and $4 \times 10^4$ genes; so there is a quadratic relationship between the two. In our block model, this would mean that it is reasonable to assume that we have roughly $\sqrt{n}$ blocks, each of $\sqrt{n}$ genes. This means that no matter how fast (say T steps for a block of size $\sqrt{n}$) an individual block can adapt, the whole genome will still adapt only at a rate of $\sqrt{n}T$. This means that the fast logarithmic rate adaptations that are sometimes observed within a single gene, are irrelevant for genome wide adaptation where they are slowed down by the block structure. It also explains Orr’s (2006) result for why Orr-Gillespie adaptation is slower in a block structured genome.

This is the fourth of a series of posts on computational considerations of empirical and theoretical fitness landscapes. In the previous post I gave a historic overview of fitness landscapes as mental and mathematical metaphors, highlighted our empirical knowledge of them, and explained the Orr-Gillespie theory for minimizing statistical assumptions.

### References

Kauffman, S.A. (1993). The Origins of Order: Self-Organization and Selection in Evolution. Oxford University Press.

Kauffman S, & Levin S (1987). Towards a general theory of adaptive walks on rugged landscapes. Journal of Theoretical Biology, 128 (1), 11-45 PMID: 3431131

Kauffman, S.A., & Weinberger E.D. (1989) The n-k model of rugged fitness landscapes and its application to maturation of the immune response. Journal of Theoretical Biology, 141:211-245, 1989.

Orr, H.A. (2006). The population genetics of adaptation on correlated fitness landscapes: the block model. Evolution 60(6): 1113-1124.

Perelson, A.S., & Macken, C.A. (1995). Protein evolution of partially correlated landscapes. Proc. Natl. Acad. Sci. USA. 92:9657-9661.

Weinberger, E.D. (1996). NP completeness of Kauffman’s N-k model, a tunably rugged fitness landscape. Santa Fe Institute working paper: 1996-02-003.

Wright, A.H., Thompson, R.K., & Zhang, J. (2000). The computational complexity of N-K fitness functions. Evolutionary Computation, IEEE Transactions on 4(4): 373-379.

From the Department of Computer Science at Oxford University and Department of Translational Hematology & Oncology Research at Cleveland Clinic, I marvel at the world through algorithmic lenses. My mind is drawn to evolutionary dynamics, theoretical computer science, mathematical oncology, computational learning theory, and philosophy of science. Previously I was at the Department of Integrated Mathematical Oncology at Moffitt Cancer Center, and the School of Computer Science and Department of Psychology at McGill University. In a past life, I worried about quantum queries at the Institute for Quantum Computing and Department of Combinatorics & Optimization at University of Waterloo and as a visitor to the Centre for Quantum Technologies at National University of Singapore. Meander with me on Google+ and Twitter.

### 18 Responses to NK and block models of fitness landscapes

1. nmancuso says:

Thank you for taking the time to explain and investigate these topics! This series on fitness has been fascinating. However most of these models assume that mutation is rare. In the case of RNA viral populations, progeny w.h.p will differ from their “parent” strain. It is because of this reason that variants are “coupled”. In this sense, selection works over the entire population, rather than individual viral variants. Will you be discussing any models that take these factors into account?

• Thank you!

I am mostly interred in these models as they relate to concepts of equilibrium, and so I like simple ideas like strong-selection weak-mutation. This is mostly because the previous blogposts were prep for this paper that I hope to blog about later today or tomorrow.

With very high mutation rates, everything gets too stochastic and spread out in the genetic space to really make a meaningful statement about a point corresponding to equilibrium. As such, I don’t know too much about these models. However, if you suggest a particularly good paper then I will take a look and see if brings some ideas to mind. I enjoy learning about the various models that are studied, but as a non-biologist, it takes me a lot of effort to figure out which models are “standard”, “good”, or even popular.

There are so many models and metaphors for evolutionary dynamics and adaptation out there that it becomes hard to keep track sometimes. It would be great to have a nice taxonomy of evolutionary metaphors, so that one could quickly check where their model fits in the mind space and how far certain results generalize.

• nmancuso says:

As a computer scientist who is new to this area, I am far from being completely versed in previous results. However, my colleagues on the bio-side frequently cite “Neutral evolution of mutational robustness” by van Nimwegan et al.

Your paper is interesting! The worst-case analysis approach seems to be quite useful. Is it possible to include recombination in these models? I would imagine such a model could drastically alter the worst-case path-length.

• I haven’t read the neutral mutation paper, but it seems cute from the abstract, I will be sure to give it a closer look. However, just from the abstract you can see that it was written by physicists, not computer scientists. They could have easily discussed how long it would take their population to converge to the eigenstate since that mixing time is set by the spectral gap.

Thanks for taking a look at my paper :D. There are three results there, and two of them are not a good model of recombination (although SSWM can be used for sexual populations, it is just the mutations are so rarely that there is never more than one thing around to recombine), but the third result (presented first in section 3) holds for any poly-time model. In particular, even for sexual populations with recombination (or even other mechanisms we haven’t figured out, yet), evolution will not find a local peak in polynomial time on some fitness landscapes.

2. Robert Vander Velde says:

Not sure if this interests you (from the world of responding to the very wrong and therefore a bit basic), but I thought your post supplemented some of this response well:

http://pandasthumb.org/archives/2014/10/dembskis-argume.html

• This is interesting, thanks for the link and sorry for the slow response.

I don’t really understand how people can go this route in their conclusions, I guess if they really want ID to be true. For me, when one points out that certain landscapes require long walks (or some other awkward property) then that is just an observation about fitness landscapes as mental models. We don’t have enough experimental knowledge (nor can we ever have, as I occasionally argue based on their exponential sizes) about empirical landscapes so any conclusion is just an request for theoretical refinement not some “proof” of something.

Another issue I have with what I gather of Dembski & Marks argument is that they allow completely arbitrary landscapes and then search for a peak. In such a setting it is really easy to use query complexity (some discussion here of what I mean by that. Although that post is focused on the quantum case — since it was written for a completely different context — it hopefully gives you a flavor of what the probabilistic or deterministic cases would look like; the non-deterministic case of certificate complexity is discussed — again in passing — in a follow up post) to get a separation result. The difference with my complexity results is that I prove them for existing models like the NK model and with the easier to achieve (and more relevant, but harder to prove hardness results for) objective of a local fitness peak (or plateau), or for artificial models that would seem very simple to a biologist like the semi-smooth fitness landscape. Also, note that Chaitin tried to do the opposite of Dembski & Marks by picking smooth-fitness landscapes and noting that convergence of blind guessing is exponential worse than his version of evolution which is only quadratically worse than his version of ID.