Programming playground: A whole-cell computational model

Three days ago, Jonathan R. Karr, Jayodita C. Sanghvi and coauthors in Markus W. Covert’s lab published a whole-cell computational model of the life cycle of the human pathogen Mycoplasma genitalium. This is the first model of its kind: they track all biological processes such as DNA replication, RNA transcription and regulation, protein synthesis, metabolism and cell division at the molecular level. To achieve this, the authors integrate 28 different sub-models of the known cellular processes.

Figure 1A from Karr, Sanghvi et al. (2012)

Figure 1A from Karr, Sanghvi et. al (2012): A diagram of the 28 sub-models, colored by category: RNA (green), protein (blue), metabolic (orange), DNA (red). The modules are connected by arrow representing common metabolites (orange), RNA (green), proteins (blue), and DNA (red).

The key technical accomplishment was integrating the 28 modulus into a single model. Each module is based on existing models, but different modules are expressed in different paradigms: ODE, Boolean, probabilistic, and constraint-based. For me, this is the most impressive aspect of the work. Usually, when I look at biology (or psychology), I see a mishmash of models with each expressed in its own language and seemingly incompatible with the others. The authors overcame this by assuming the modulus are independent on short timescales (under 1 second). This allows the software to keep track of 16 global cell variables which are used as inputs for the submodulus that are run to simulate 1 second and their results used to update the global variables and repeat the loop. The whole software is available online and the authors can use the data gathered to produce a video of a single cell’s life cycle:

The authors show that the model has a high level of agreement with existing data. They also use the predictions to run several novel real-biology experiments, and even partially overturn (or complete) a previous experimental observation based on hints from their model. In particular they show that disruption of the IpdA gene — which Glass et al. (2006) suggested as non-essential — has severe (but noncritical) impact on cell growth. I wish I could comment more on the validity of the model as judged by experiments, but molecular biology is magic to me.

The simulation results that were most exciting for me was looking at the effects of single-gene disruptions on phenotype. The bacterium Mycoplasma genitalium is a human urogenital parasite whose genome contains 525 genes (Fraser et al., 1995). It is not an easy model organism to work with, but it has the smallest known genome that can constitute a cell. Part of the team on this project, is from J. Craig Venter Institute and has extensive experience with the organism due to their effort to create the first self-replication synthetic life by implanting artificial DNA into Mycoplasma genitalium. I would not be surprised if this model plays a vital part in the institute’s engineering.

Karr, Sanghvi et al. (2012) ran simulations of each of the 525 possible single-gene disruption strains. They found that 284 genes were essential to sustain growth and division and 117 are non-essential — a 79% agreement with the experimental results of Glass et al. (2006). Of particular interest for me was that in some cases it took more than one generation for specific proteins to fall to lethal levels. As far as I understand this is because when a single-cell divides, daughters get both a copy of the mother DNA and have their initial levels of proteins and RNA set to within statistical fluctuations of those of their mother. Due to my complete lack of basic biological background, this seemed an interesting example of Lamarkian evolution. In particular, it raises questions on how to best combine single-cell learning and evolution. From a naive Bayesian model of learning, it would seem that this would allow cells to pass on their priors — a biological evolution counterpart to Beppu & Griffiths (2009) cultural ratchet.

The detail of the whole-cell model is impressive. I hope that the software becomes a tool for theorists without access to a wet-lab to play around with cells. The approach is an antithesis to the simple and completely unrealistic models I am accustomed to building. For me, it raises many thoughts on how to better think about the distinction between genotype and phenotype that is almost always ignored in evolutionary game theory. For now the whole-cell model is computationally too expensive for me to build evolutionary dynamics from it, but maybe parts of the code can be simplified or ignored or maybe we could use more course-grained models. Either way, I am excited for my new playground!

References

Beppu, A., & Griffiths, T. (2009). Iterated learning and the cultural ratchet. Proceedings of the 31st Annual Conference of the Cognitive Science Society, 2089-2094.

Fraser, C.M., Gocayne, J.D., White, O., Adams, M.D., Clayton, R.A., Fleischmann, R.D., Bult, C.J., Kerlavage, A.R., Sutton, G., Kelley, J.M., et al. (1995). The minimal gene complement of Mycoplasma genitalium. Science 270, 397–403.

Glass, J.I., Assad-Garcia, N., Alperovich, N., Yooseph, S., Lewis, M.R., Maruf, M., Hutchison, C.A., Smith, H.O., & Venter, J.C. (2006). Essential genes of a minimal bacterium. Proc. Natl. Acad. Sci. USA 103, 425–430.

Karr, J.R., Sanghvi, J.C., Macklin, D.N., Gutschow, M.V., Jacobs, J.M., Bolival, B., Assad-Garcia, N., Glass, J.I., & Covert, M.W. (2012). A whole-cell computational model predicts phenotype from genotype Cell, 150, 389-401 DOI: 10.1016/j.cell.2012.05.044

Advertisements

Critique of Chaitin’s algorithmic mutation

Last week, I reviewed Gregory Chaitin’s Proving Darwin. I concentrated on a broad overview of the book and metabiology, and did not touch on the mathematical details; This post will start touching. The goal is to summarize Chaitin’s ideas more rigorously and to address Chaitin’s comment:

I disagree with your characterization of algorithmic mutations as directed.
They are in fact random.
Rgds,
GJC

Specifically, I need to (a) summarize the mathematical details of algorithmic mutations, (b) clarify the distinction between random mutation and (what I called) randomized directed mutation, and (c) defend my assertion that algorithmic mutations are not purely random but directed. I need to fully explain my earlier claim:

The algorithmic mutation actually combine the act of mutating and selecting into one step, it is a n-bit program that takes the current organism A and outputs a new organism B of higher fitness (note, that it needs an oracle call for the Halting-problem to do so). The probability that A is replaced by B is then 2^{-n}. There is no way to decouple the selection step from the mutation step in an algorithmic mutation, although this is not clear without the technical details which I will postpone until a future post. Chaitin’s model does not have random mutations, it has randomized directed mutations.

The high level description of algorithmic mutations given in Chapter 5 is not necessarily inconsistent with random mutations, but it is not detailed enough to make a concrete judgement. We have to look at Appendix 2 (The Heart of the Proof) to really understand how Chaitin’s mutations act. As before, an organism A is a halting self-delimiting program, and A‘s fitness \phi(A) is the integer A outputs upon halting. I will use t(A) to mean the running time of A and will set t(P) = \infty for a non-halting program P.

A mutation M_K works as follows. M_K(A) calculates a lowerbound \Omega_{\phi(A)} = \sum_{A \in S(\phi(A))} \frac{1}{2^{|A|}} of Chaitin’s constant \Omega, where S(N) is the set of programs smaller than N bits in length and halting by N steps. M_K calculates a new number \beta = \Omega_{\phi(A)} + 2^{-K} and mutates A into A' = \pi\beta. Here \pi is a self-delimiting prefix that finds the first N such that \Omega_N \geq \beta, outputs N, and halts. If \beta < \Omega then A' a halts and is thus an organism with fitness \phi(A') = \beta. If \beta \geq \Omega then A' is a non-halting program and thus not an organism (or at best an organism with an undefined fitness \phi(A')). Chaitin’s mutations either return a strictly more fit organism, or do not return one at all (equivalently: an organism with undefined fitness).

To simulate evolution (as opposed to a deterministic picking of mutations), Chaitin selects the mutations randomly (with mutation M_K being selected in proportion to 2^{-|M_K|} ). This is probably why he writes them as ‘random’, however picking a directed mutation at random, is just randomized directed mutation. The concept of a mutant having lower fitness, is undefined in metabiology. A defense would be to say that algorithmic mutations actually combine the steps of mutation and selection into one — this is a technique that is often used in macroevolutionary models. However, I cannot grant this defense, because in macroevolutionary models, even when a fitter agent is always selected, the mutations are still capable of defining an unfit individual. Further, when macroevolutionary models set up such a hard gradient (of always selecting the fitter mutant) then these models are not used to study fitness because it is trivial to show unbounded fitness growth in such models.

Is Chaitin’s model the simplest it can be? Chaitin spends the first four chapters defending algorithms/software as metaphors for genes; Does metabiology actually use the power of algorithms? Chaitin does not attempt to address abiogenesis in his model and the first organism is an arbitrary program. Every subsequent program arises from the original by a sequence of algorithmic mutations; but every mutation returns an program of the form \pi \beta where \beta is some rational number greater than 0 and less than \Omega. In other words, the genes are only capable of programming:

N = 1
while \Omega_N < \beta do N = N + 1 end
print N

I hid the subroutine that computes \Omega_N but we will see that it is a superficial feature. \beta and the N that the algorithm returns are completely equivalent. A larger \beta returns a larger N. Since the prefix \pi is always repeated, I will just represent organisms by the rational number after \pi. Instead of \pi \beta, I will write \beta. Now, I can simple re-define the fitness function:

\phi(\beta) = \begin{cases}   \beta & \mathrm{if} \; \beta < \Omega \\   \bot & \mathrm{otherwise}    \end{cases}

and the mutation operator:

M_K(\beta) = \begin{cases}   \beta + \frac{1}{2^K} & \mathrm{if} \; \beta < \Omega - \frac{1}{2^K} \\  \beta & \mathrm{otherwise}  \end{cases}

and metabiology remains completely unchanged, with the exception of the original organism which we can just set as 0 for our purposes. Of course, instead of the fitness growing higher and higher it grows to be a better and better lower bound for \Omega. All the algorithms are stripped away, except in that \Omega comes from algorithmic information theory. However, we don’t need \Omega to achieve all of Chaitin’s results: we can use any irrational number between 0 and 1 like \frac{1}{\sqrt{2}} (in fact, the number doesn’t even need to be irrational it just has to have a non-terminating expansion in base 2). In other words, the algorithmic part of metabiology is superficial.

Further, by looking at metabiology in the simple terms I described above we can also overcome the directed nature of ‘algorithmic’ mutations. We just modify the mutation operator to be indexed by K and either – or + with both K and – or + selected from some distribution. A natural distribution is to pick K proportional to 2^{-K} and – or + uniformly. Now, we can define a random mutation operator:

M_{K,+}  \begin{cases}   \beta + \frac{1}{2^K} & \mathrm{if} \; \beta < \Omega + \frac{1}{2^K} \\  \beta & \mathrm{otherwise}  \end{cases}

and

M_{K,-}  \begin{cases}   \beta - \frac{1}{2^K} & \mathrm{if} \; \beta \geq \frac{1}{2^K} \\  \beta & \mathrm{otherwise}  \end{cases}

Then given a focal organism \beta and a mutant \beta' we can use our favorite selection rule to chose who survives. A tempting one is to use Chaitin’s hard-max and let whoever is larger survive. Alternatively, we can use a softer rule like \beta' takes over with probability proportional to exp(\beta' - \beta) (this is a popular rule in evolutionary game theory models). Alternatively, we can stick ourselves plainly inside neutral evolution and just say \beta' always takes over regardless of fitness. This would create a random walk model, which can still show an increase in complexity as proposed by Gould (1997) and summarized on this blog by Julian.

References

Chaitin, G. (2009). Evolution of Mutating Software EATCS Bulletin, 97, 157-164

Ethnocentrism in finite inviscid populations

In preparation for EGT Reading Group 31 and as a follow up to my previous post on inviscid ethnocentrism, I decided to look at Martin Nowak’s first paper on tag-based models (Traulsen & Nowak, 2007). For Arne Traulsen, this was the 3rd paper on ethnocentrism. Traulsen’s first publication (Traulsen & Shuster, 2003) provided a simplified analysis of the classic Riolo et al. (2001) and considered linked strategy-tag combinations (where one has to cooperate with the in-group), but showed that the analysis is possible with just two tags and not a continuum. Traulsen & Claussen (2004) considered a model closer to our familiar H&A model and showed the evolution of ethnocentrism in a spatial lattice. Trausen & Nowak (2007) follow Jansen & van Baalen (2006) in building a model in which cheaters (defectors against in-group; selfish agents in our terminology) are possible but no spatial structure is available. They move past Jansen & van Baalen (2006) by considering finite populations in which only two tags need to co-exist at a time.

The authors consider a model with a fixed number N of agents. Each agent has one of K possible tags, and interactions happen only within a tag. The agent can be either a cooperator (ethnocentric in our terminology) or defector (selfish in our terminology), allowing for a total of 2K phenotypes. The payoffs from random interactions are computed according to the standard b-c Prisoner’s dilemma game, with b the benefit of receiving cooperation and c the cost of providing the cooperation. The population is updated by selection two individuals uniformly at random and having the first adopt the strategy of the second (with some probability of mutation) with probability (1 + e^{\beta(\pi_1 - \pi_2)})^{-1} where \pi_1 is the utility of the first individual, and \pi_2 — the second. The parameter \beta corresponds to the strength of selection, and in the analytic model the regime of weak selection (\beta << \frac{1}{N}) is considered.

The probability of mutating strategy while keeping tag fixed is given by u, and the probability of mutating both strategy and tag at the same time is given by v. With this Traulsen & Nowak arrive at their main result, cooperative behavior evolves if:

\frac{b}{c}  > 1 + 2\frac{u}{v}

Note that this cooperative behavior is not permanent, but comes in tides of tolerance. Cooperators of a given tag dominate until invaded by a defector. So the prediction is meant to apply to long-term averages over many evolutionary cycles and not a stable absorbing state.

The authors instantiate their analytic approach in three concrete models. For the first, they consider a binary genotype of length L + 1 where the first bit gives the strategy and the remaining L bits are the tag. They assume that any bit has a mutation probability \mu, but do not allow double mutations in the tag. This gives the condition for cooperation:

\frac{b}{c} > 1 + 2\frac{1 - \mu}{L\mu}

Since their analysis is valid for only small mutation rates \mu, this suggests a high \frac{b}{c}. However, it is not clear to me why the authors didn’t simply allow an arbitrary number of mutations in the tag-coding region. This would give u = \mu (1 - \mu)^L and v = \mu(1 - (1 - \mu)^L) which provides a simpler condition of:

\frac{b}{c} > 2\frac{1}{1 - (1 - \mu)^L} - 1

Which in the limit of large L (or more formally as L \rightarrow \infty) and small (relative to size of genome) mutations \mu = \frac{x}{L} gives:

\frac{b}{c} > 2\frac{1}{1 - e^{-x}} - 1

Which means that cooperation evolves if the mean number of point-mutations per offspring x is greater than \ln(\frac{b + c}{b - c}).

For their second model, Traulsen & Nowak consider the more reasonable case with the parity of the first n bits coding the strategy. The remaining bits, along with L bits of overlap with the first n, code the tag. Here, in the limit of one point-mutation per generation (\mu <<  \frac{1}{n}) they arrive at the requirement for cooperation:

\frac{b}{c} > \frac{2n - L}{L}

Note that this scales to the trivial b/c > 1 for linked strategy-tag genes, thus recreating (in-spirit) the earlier results of Riolo et al. (2001) and Traulsen & Schuster (2003).

Traulsen & Nowak’s last model is of the greatest interest to me, since it resembles mutation in the H&A model. Here the authors consider equal mutation probability between any strategy-tag pair and with K possible tags arrive at the condition:

\frac{b}{c} > \frac{K + 1}{K - 1}

This means we need \frac{b}{c} > 3 in the minimal case of 2 tags, and the requirements relax towards the trivial \frac{b}{c} > 1 as K \rightarrow \infty. In my earlier preliminary results, I concentrated on 4 tags and b/c = 4 which is higher than the 5/3 required from Traulsen & Nowak’s results, and observed both tides of tolerance (periodic fluctuations in amount of cooperative interactions) and absorbing states of all-cooperation or all-defection. I also briefly showed that b/c = 5/4 bifurcates, and this can be even more clearly seen in the following graph of b/c = 6/5 with only two tags:

Results for b/c = 1.2 and t = 2

Proportion of cooperation versus evolutionary cycle of 30 simulations run for 3000 cycles. The benefit-to-cost ration is 1.2 and there are only two tags. Runs that had less than 5% cooperation (24 runs) in the last 500 cycles are traced by red lines; more than 95% cooperation (3 runs) by green; intermittent amounts (3 runs) are yellow. The black line represents the average of all 30 runs, no standard error is shown.

This is well outside the regime where Traulsen & Nowak predict cooperation. The model I use has a simpler selection rule and incorporates free-space in the inviscid population; it is not meant to question Traulsen & Nowak. However, it is nice to know that my preliminary results produce cooperative interactions outside of the known regions of mutation-driven ethnocentrism.

References

Jansen VAA, & van Baalen M (2006). Altruism through beard chromodynamics. Nature 440: 663–666.

Riolo RL, Cohen MD, & Axelrod R (2001). Evolution of cooperation without reciprocity. Nature 414: 441–443.

Traulsen A, & Claussen JC (2004). Similarity-based cooperation and spatial segregation. Phys Rev E 70: 046128.

Traulsen A, & Nowak MA (2007). Chromodynamics of cooperation in finite populations. PLoS One, 2 (3) PMID: 17342204

Traulsen A, & Schuster HG (2003). Minimal model for tag-based cooperation. Phys Rev E 68: 046129.