Since I am again short of a post, I thought I’d share this week a simple proof of a bound possible with these techniques. This is based on an old note I wrote on 19 April 2011.
One of the big conjectures in quantum query complexity — at least a half decade ago when I was worrying about this topic — is that quantum queries give you at most a quadratic speedup over deterministic queries for total functions. In symbols: . Since Grover’s algorithm can give us a quadratic quantum speed-up for arbitrary total functions, this conjecture basically says: you can’t do better than Grover.
In this post, I’ll prove a baby version of this conjecture.
Let’s call a Boolean total-function easy to certify if one side of the function has a constant-length certificate complexity. I’ll prove that for easy-to-certify total functions, .
This is not an important result, but I thought it is a cute illustration of standard techniques. And so it doesn’t get lost in my old pdf, I thought I’d finally convert it to a blog post. Think of this as a simple application of the adversary method.
Consider a non-constant total function . Let b be the output that corresponds to the part of the function that is harder to certify (for a definition of certificate complexity, see my old post). In other words, we will call the bigger certificate and the smaller . Thus, we have that (the last inequality follows from the fact that is non-constant). Now consider an input x such that f(x) = b and and let S be a minimal certificate of size |S| = u. Define as the set of all strings x’ such that x’ and x agree on all bits in S. More formally:
Since S is a certificate, we know that , where we overloaded notation in the obvious way to serve as shorthand for . Further,since f is total, we know that .
Let x(i) be x with the i-th bit flipped. Consider an arbitrary . If for all we have f(x) = b then i is non-necessary for S to be a certificate, and we can remove it, contradicting the fact that we picked a minimal certificate. Thus:
Let , we just showed that for every , this set is non-empty.
Over all the consider the one with the smallest minimal certificate. In other words, for every pick a y such that for all . From the definition of certificate complexity, we thus know that . Let be a minimal certificate for y.
Imagine that then there exists a . However, such a z is paradoxical since it is b-certified by S and -certified by . Thus, , in fact, they must overlap on a bit on which x and y differ. In other words, we must have .
Now, consider the set . We will show that this is a subset of . Since any agrees with y on , we have a -certificate for y’. In other words, . Further, since , we have that . Putting the two together, we prove the claim .
Now we can do a simple calculation to lower bound the size of :
Further, notice that for each there exists an such that (i.e. they differ only on the i-th bit). Consider a bipartite graph with the left partition being and the right partition being the union of the . Add an edge between and if x” and y” differ by one bit. We already observed that for each y” there is an edge to S^x, thus the total number of edges to S^x is greater than:
From this, we can conclude that the average degree of a vertex is greater than .
In particular there is some vertex such that the size of its neighbourhood (which is equal to its degree) . Further for each we have and each differs from by exactly one bit. In other words, we have shown that the sensitivity .
Using either Ambainis’ method or the polynomial method, it is not hard to show that , thus:
For constant it gives us what we desire: for total functions with one of its certificates of constant size.
]]>Roughly, in easy landscapes, we can find local peaks quickly and in hard ones, we cannot. But this is very vague. To be a little more precise, I have to borrow the notion of orders of growth from the asymptotic analysis standard in computer science. A family of landscapes indexed by a size n (usually corresponding to the number of genes in the landscape) is easy if a local fitness optimum can be found in the landscapes in time polynomial in n and hard otherwise. In the case of hard landscapes, we can’t guarantee to find a local fitness peak and thus can sometimes reason from a state of perpetual maladaptive disequilibrium.
In Kaznatcheev (2019), I introduced this distinction to biology. Since hard landscapes have more interesting properties which are more challenging to theoretical biologist’s intuitions, I focused more on this. This was read — perhaps rightly — as me advocating for the existence or ubiquity of hard landscapes. And that if hard landscapes don’t occur in nature then my distinction is pointless. But I don’t think this is the most useful reading.
It certainly would be fun if hard landscapes were a feature of nature since they give us a new way to approach certain puzzles like the maintenance of cooperation, the evolution of costly learning, or open-ended evolution. But this is an empirical question. What isn’t a question is that hard landscape are a feature of our mental and mathematical models of evolution. As such, all — or most, whatever that means — fitness landscapes being easy is still exciting for me. It means that the easy vs hard distinction can push us to refine our mental models such that if only easy landscapes occur in nature then our models should only be able to express easy landscapes.
In other words, using computational complexity to build upper-bounds arguments (that on certain classes of landscapes, local optima can be found efficiently) can be just as fun as lower-bounds arguments (that on certain classes of landscapes, evolution requires at least a super-polynomial effort to find any local fitness peak). However, apart from a brief mention of smooth landscapes, I did not stress the upper-bounds in Kaznatcheev (2019).
Now, together with David Cohen and Peter Jeavons, I’ve taken this next step — at least in the cstheory context, we still need to write on the biology. So in this post, I want to talk briefly about a biological framing of Kaznatcheev, Cohen & Jeavons (2019) and the kind of fitness landscapes that are easy for evolution.
First, I need to address one more ambiguity in the definition of easy vs hard landscapes. This one is more of a feature than a bug. And it is the ambiguity over “can be found”. In particular, what class of algorithms are we considering? In Kaznatcheev (2019), I use this ambiguity to consider three progressively more restrictive views of hard landscape. When talking about semismooth fitness landscapes, I consider a family of landscapes hard if random fitten mutant or fittest mutant strong-selection weak-mutation (SSWM) dynamics cannot find the peak in polynomial time. This is a notion of hard for a particular — although popular — evolutionary model. For the more rugged notion of landscapes, I consider two broader classes. First is any SSWM dynamics — i.e. any dynamic that is localized to a single point and moves only uphill — and second is any polynomial time algorithm — so any evolutionary dynamic, even ones that are distributed over the genotype space, go through valleys, jump around, etc. Obviously, later versions of hardness imply earlier ones, but earlier one do not imply later ones.
A similar issue can arise with upper-bounds. In the cstheory literature, if we want to estalish a complexity upper bound, we just need to find some algorithm that solves the problem efficiently. This isn’t compelling to a biologist. More importantly, it isn’t compelling to me, since it throws away the abstraction over unknown evolutionary dynamics that I wanted to achieve with algorithmic biology. But being easy for any algorithm is impossible: just consider the algorithm that doesn’t do anything or goes strictly downhill instead of up.
As such, we need a reasonable class of evolutionary dynamics as the standard for what makes a landscape easy. In the case of Kaznatcheev, Cohen, and Jeavons (2019), we went for a combinatorial feature of landscapes: longest paths. We said a family of landscapes is easy if all adaptive paths are at most polynomial length. In terms of dynamics, this means that any local adaptive dynamic — i.e. a dynamic that goes strictly uphill in the fitness graph — no matter how silly, will find a local fitness peak in polynomial time.
This is the definition under which smooth landscapes are easy. In a smooth landscape, there is a single peak. Given any point in genotype space, if that point is Hamming-distance m from the peak — i.e. differs from the peak by m mutations — then any adaptive path from that point has length at most m. And since m is always less than or equal to the number of loci n, this means all smooth fitness landscapes are easy.
Our goal was to find a richer class of easy fitness landscapes.
For this, we needed a compact way of representing fitness landscapes. So we spent the first half of the paper defining constraint graphs and equivalence classes of constraint graphs. In biological terminology, these are gene-interaction networks.
By using the results of section 4 of Kaznatcheev, Cohen & Jeavons (2019), we can define a gene-interaction network as follows. Given two loci i and j with alleles in D, we will say there is an interaction between them if there is some assignment y to the other n – 2 loci such that the sublandscape has at least sign epistasis. If no such y exists then we say that i and j don’t interact. We then have the gene-interaction network as a graph where the vertices are loci and edges are drawn between any two loci that interact. See section 3 and 4 for how this definition relates to the generalized NK-model.
In this representation, a smooth landscape — which is defined as having no sign-epistasis — would correspond to a gene-interaction network with no edges.
In section 5, we establish with Theorem 6 that in a biallelic system (i.e. D = {0,1}) if the gene-interaction network is a tree (or forest) then it corresponds to an easy fitness landscape. More strictly, we show that the longest adaptive path in a fitness landscapes correspond to any gene-interaction tree on m loci has length at most .
The proof is cute and I would encourage you to read it. Pun intended — although only clear after reading Definition 12.
We also establish that this result is tight in various ways. For example, there is indeed a biallelic gene-interaction tree (path, in fact) that produces a fitness landscape with an adaptive path of length . If we move from biallelic to triallelic landscapes then the result disappears: there is a triallelic gene-interaction tree (path, in fact) that produces a fitness landscape with an exponentially long adaptive path. Similarly, for biallelic systems, if we move from trees to gene-interaction networks with treewidth two then there is again a fitness landscape with exponentially long adaptive paths.
There are a number of important caveats here.
First, just because the particular triallelic or biallelic treewidth two landscapes aren’t easy doesn’t mean they are hard. In particular, they produce long walks for silly rules like reluctant adaptive dynamics (i.e. moving to the fitter mutant that has the smallest increase in fitness — least-fit SSWM).
Second, it doesn’t mean there aren’t other classes of easy landscapes that don’t have tree-like gene-interaction networks. An obvious such class is block-models with constant-sized blocks.
But this can give us at least a bit more intuition on what kind of fitness landscapes are easy and why.
Kaznatcheev, A. (2019). Computational complexity as an ultimate constraint on evolution.Genetics 212(1): 245.
Kaznatcheev, A., Cohen, D.A., & Jeavons, P.G. (2019). Representing fitness landscapes by valued constraints to understand the complexity of local search. 25th International Conference on Principles and Practice of Constraint Programming arXiv:1907.01218
]]>Recently, David Basanta stumbled across an old (19 March) twitter thread by Dan Quintana for why people should use such twitter threads, instead of blog posts, to announce their papers. Given my passion for blogging, I think that David expected me to defend blogs against this assault. But instead of siding with David, I sided with Dan Quintana.
If you are going to be ‘announcing’ a paper via a thread then I think you should use a twitter thread, not a blog. At least, that is what I will try to stick to on TheEGG.
Yesterday, David wrote a blog post to elaborate on his position. So I thought that I would follow suit and write one to elaborate mine. Unlike David’s blog, TheEGG has comments — so I encourage you, dear reader, to use those to disagree with me.
First, it might be useful to learn more about how the best of these twitter threads are written. For this, Lynn Chiu has a nice post — that reads more like a twitter thread itself — on why she loves reading twitter threads on papers. Her ten features of a good paper thread are:
Lynn Chiu’s reason for highlighting these features is rather subversive and very exciting to me. These 10 points show good scientific method — something that standard academic publishing has become too stifling to allow. For her, they embody Stuart Firestein’s philosophy of ignorance and failure as the engines of sciences. Mainstream publishing and the culture around it is not capable of reflecting these important drivers of science so we have to turn to new media like twitter.
This is something that sits very well with my own views of failure and falsehood as primary and capital-T Truth as stifling. And why I like to develop many of my ideas on the blog. Why I use the blog as a discussion platform that is not possible through papers, and not as an extension or advertisement for papers. Why I think of the blog as a tool to do science rather than share science.
But this is not how many people read Lynn Chiu’s 10 points.
Instead, many seem to have read it as a how-to guide for science communication. Or, even more sinister, as a way to optimize our alt-metrics and advertise for our papers. A way to do free marketing work for scientific publishers. If we already write, review, curate, and edit for free — why not also do our own free marketing? After all, this might be that rain dance that gives us that slight edge in an insecure academic market.
And if we come at paper ‘advertising’ from this perspective then why not? Let’s advertise on both twitter and blogs. Why not throw in Reddit? Let’s optimize our conversion rate and eyeballs reached. Let’s completely lose sight of Lynn Chiu’s radical message and instead along with being good scientists become great marketers in mixed media. Write our papers in one voice, our blogs in a second, and our twitters in the vulnerable voice of a parasocial relationship with our adoring followers.
I don’t want to accuse David of embracing the above view. David has an earnest commitment to science communication that is driven by a passion for sharing the adventure of science. As far as I know David, I know that he is not driven by these careerist calculations.
But sometimes I am.
And I don’t think I am alone.
I see lots of people doing ‘science communication’ just because we think it will be that extra edge that makes our paper noticed in the information overload. That one tick that will land us the next step in an extremely shaky career ladder.
This perspective tends to treat ‘the public’ and other scientists as consumers. Maybe semi-active ones that can retweet, share and like. But still, an ‘other’ that we try to trick with nice rhetorical flourish (“be vulnerable and tell a good story”) into appreciating our work.
From this perspective, advertising a paper feels like a way to talk at my audience. Rather than a way to listen.
And when I fall into this cynical perspective, I feel like I am not doing good. Like I am serving the interests of myself or my class above those of the people.
And I don’t want to do this.
So this is why I was opposed to using blogs for paper advertising in my twitter discussion with David.
On TheEGG, I feel like I am still ‘doing science in the open’. I feel like I am using the process of writing for this blog as a way to develop new ideas. I feel like the feedback I get from readers (either directly through the blog comments or indirectly through popularity and discussions elsewhere) filters into my research.
Of course, given that the research I do often ends up in papers, there is a lot of content overlap between what I write here and what I write in more mainstream academic venues. And I will eagerly reference or link to my published work if it is relevant. But I try not to write here just for the sake of promoting my more careerist objectives. I try to write for TheEGG for the sake of TheEGG. And for the sake of developing ideas.
As such, I don’t want to use my blogging to advertise my papers.
I also want to encourage more bloggers to follow a similar path. I want other scientists to view blogging as not subordinate to — or worse yet, a distraction from — mainstream publishing, but as a legitimate — and often more effective — means of discussion and development of science. Hence, I want others to also prioritize discussion over advertising.
That said, no matter how heretical my views may seem at times, I’m still a professional academic. I still take pride in a paper that has finally ‘made it past the finish line’. And I still want to share that excitement with my friends and colleagues. So I will continue to tweet about papers and even make the occasional thread.
In this regard, I liked a particular point that Dan Quintana raised in his argument for twitter threads (over blog posts):
By dividing a paper into a series of 280 character bite-sized ideas, we can test which specific points in our work resonate with people the most. This can be a way to listen to the public. An extra (even if minor) line of evidence for determining which future directions to pursue. If a particular point draws the most likes/retweets/comments in a thread then maybe that is a point that is worth elaborating on in the future.
In practice, this is a hard signal to interpret. At least in the case of tweets like mine that only generate a modest amount of engagement.
But hard is better than impossible. Such fine-grained division is not possible with blogs.
What about the points that David makes in favor of blogs? His four advantages of blogs are:
I am not convinced.
Just because web access is more widespread than twitter, doesn’t mean that a blog post is read more than a tweet. In the case of TheEGG, it might be the case that the blog has a bigger following than my twitter. But there are plenty of tweeps who have much large following on twitter than their blogs. For example, I would be surprised if David gets more readers with his blog than his twitter, but he can correct me. And for engagement, I think it is even more biased towards twitter. Even on TheEGG, more feedback comes through twitter than direct comments on the blog (and even more feedback is on Reddit). Finally, to read a twitter thread, one doesn’t need an account and can follow any link to the thread as they would with a blog post.
Twitter does suffer from searchability issues. But the transience disadvantage seems exaggerated. Again, picking on David as an example: his blog has migrated through several platforms and urls in the past years. I think that his twitter account has been the same.
More generally, there are many dead blogs on the internet. Many from people who continue to be active on other forms of social media.
It is true that blogs are easier to cite, especially in an academic setting. So I do think new ideas should be developed or at least fleshed-out on blogs. But it will be a great problem to have when academics are thinking about how to cite blogs vs. twitter. Right now, it seems that most simply don’t cite either. Even if one was directly responsible for their ideas.
Finally, the monoculture and the corporation. As I’ve writen before, I think that software monoculture is a huge problem. It is a big issue to persistence and robustness of communication systems. But I don’t think advertisement threads need to be all that persistent. For example, I was very active on G+ and I am sad to see it go. I wrote posts there that are much more involved than any tweet I’ve written. But I seldom feel like I’ve missed much by their disappearance.
As for papers specifically: I think that we greatly over-estimate their half-life. There are a few papers that we read decades later, but I doubt that most papers are read by more people than their authors, reviewers, and editor. And even then not always completely by all. It makes sense to identify and preserve exceptional work, but trying to preserve everything seems about as useful as the Library of Babel. This goes double for the advertisements of those papers.
In the end, though, David closing points are important: the more the merrier. For example, just to get David’s attention I will have to tweet this blog post. More seriously: good science communication and discussion is possible in any medium. Why even focus on just twitter and blogs? There are also podcasts, videos, pub chats, and reading groups.
What matters is not so much the medium of our engagement. But that we are after earnest communication and discussion. We need to aim at the levels of papers, blogs, and tweets to not promote our accomplishments, or advertise for our results but to open dialogues. If papers are failing us in fostering discussion then we should embrace Lynn Chiu’s radical message and turn to twitter threads, or blogs, or op-eds, or podcasts, or preprints. Or reform the culture of academic publishing.
Don’t advertise your paper with a blog post. Don’t advertise your paper with a tweet. Don’t advertise your paper. Instead, open your work for discussion. And listen.
]]>We need to first have a good framework for describing and summarizing phenomena before we set out to build theories within that framework for predicting phenomena.
In this brief post, I want to ask if evolutionary games in oncology are ready for building predictive models. Or if they are still in need of establishing themselves as a good descriptive framework.
To some extent, it is silly to even consider that EGT is not a good descriptive framework in oncology. Mostly because it is already being used for this purpose! There are many many papers — some of them even by me — that use the language of evolutionary game theory to reason about problems in cancer.
But these are largely descriptions not of empirical aspects of cancer but rather a formalization of our mental models. In other words, evolutionary games have established themselves for describing theoretical puzzles in oncology. And sometimes even for resolving those theoretical puzzles.
But, to a large extent, this process is data-free or data-light. When some sort of data is used, it is often as an illustration. A way to motivate or justify the model. It is often collected besides the model — not through the model. In this way, most work seems to operate on parallel tracks of theory and experiment.
My feeling is that there is not much work already out there to use evolutionary games to describe experiments or to make empirical observations directly in oncology. So in the context of experimental oncology, I don’t think that EGT has a proven track record as a useful descriptive framework, yet.
In particular, I believe that mathematical oncology, or at least the part of it focused on evolutionary forces within the patient, is in a position where we have not found the right vocabulary, grammar, and framework for our basic terms for engaging directly with experiment more generally. We do use a language, but much of it is borrowed from physics with justifications line (1) the absorption of physicists into mathematical oncology, (2) the past success of reductionist languages in other disciplines, and (3) its coherence with our physical intuition. But all the justification feel external to oncology, like this language has not been built up for its own purposes. It has simply been borrowed. We are trying to use the language of physics to make sense of cancer. And sometimes it can be awkward.
If we wanted to better justify physics-speak for talking about experiments in cancer: I think that it is important to consider seriously an alternative choice of X-speak. Then we can contrast physics-speak with X-speak to show that physics-speak is in some way better. Or maybe we will learn something else from X-speak altogether.
Unsurprisingly, evolutionary game theory is the alternative language that I focus on. With the project of operationalization being the construction of a dictionary between the basic terms of EGT and the basic terms of the more established language of experimental practice. This is why I obsess so much about seemingly pedantic distinction between token vs type fitness and reductive vs effective games.
As in any effort of translation, new basic terms might need to be invented on both sides. This is the feedback between theory and experiment.
In such a setting, early interaction with experiment becomes an exercise in description. We use our tentative dictionary to see what the experimental story reads like in the theoretical language. This is why the game assay that we developed in Kaznatcheev et al. (2019) feels so descriptive. In some sense, it is just summarizing a large dataset as a single point in game space (alongside some error propogation for error-bars).
If the fit is natural — for example, the resulting gain functions are relatively simple, and best represented in terms of proportion rather than density — then that is some evidence for a good framework. If the fit is awkward — for example, the resulting gain functions all have terms of to cancel out the replicator dynamics — then that is strong evidence for a bad framework. The further hope is that if the fit is natural then some empirical regularity will emerge across stories — for example, maybe most signalling settings are well described by quadratic gain functions — then these regularities can be transformed into theories — which we can aim to falsify with future experiments — within the framework.
Or maybe an awkwardness in the fit between EGT and experiment will make us rethink some of our worok on EGT and mental models. For example, when less studied games like Deadlock and Leader pop-out of the game assay. This can be used to justify new exploration for pure EGT, like Archetti et al. (2015) using their experiments to reinforce Archetti’s (2013; 2014) theoretical push for nonlinear payoff functions in public good games.
Either way, the short term focus becomes doing simple and clear experiments and describing their procedures and outcomes precisely from within evolutionary game theory. Or taking existing data, and interpreting it through this framework. For example, that is why I focused attention on showing how the experiments of Li et al. (2015) can be represented in an equivalent and just as natural a way with replicator dynamics as with the authors’ choice of Lotka-Volterra equations. If that could not be done, or if the replicator equation representation was significantly more awkward, then that would be evidence against it as a good language.
This doesn’t mean that I necessarily think EGT-language is better than physics-language or Lotka-Volterra-language. It is just the language that I have chosen to focus on building into a useful descriptive framework, and I hope others do the same with the languages that they are the most proficient in. It would be very exciting, for example, to take the same experiments and design and run them in such a way that we have several different descriptive languages in mind.
In the process, we’ll probably end up with an exciting creole language that can become the natural language for describing experimental cancer biology. We might all currently speak about cancer with our various accents, dialects, and awkwardness. But the next generation will hopefully have overcome our early attempts at description and build the right framework for describing cancer.
Once that is done, we can start focusing on building models in that framework to predict cancer.
At least if we want to have prediction with understanding. Of course, this isn’t always required. Sometimes we might just want prediction without understanding. In fact, areas where such phenomenological theories are successful might be the place where we should first start looking — as long as we’re willing to throw away any ontological baggage of the successful effective theories.
Archetti, M. (2013). Evolutionary game theory of growth factor production: implications for tumour heterogeneity and resistance to therapies. British Journal of Cancer, 109(4): 1056-1062.
Archetti, M. (2014). Evolutionary dynamics of the Warburg effect: glycolysis as a collective action problem among cancer cells. Journal of Theoretical Biology, 341, 1-8 PMID: 24075895.
Archetti, M., Ferraro, D.A., & Christofori, G. (2015). Heterogeneity for IGF-II production maintained by public goods dynamics in neuroendocrine pancreatic cancer. Proceedings of the National Academy of Sciences, 112(6), 1833-8 PMID: 25624490.
Kaznatcheev, A. (2017). Two conceptions of evolutionary games: reductive vs effective. bioRxiv: 231993.
Kaznatcheev, A. (2018). Effective games and the confusion over spatial structure. Proceedings of the National Academy of Sciences: 201719031.
Kaznatcheev, A., Peacock, J., Basanta, D., Marusyk, A., & Scott, J. G. (2019). Fibroblasts and alectinib switch the evolutionary games played by non-small cell lung cancer. Nature Ecology & Evolution, 3(3): 450-456.
Li, X.-Y., Pietschke, C., Fraune, S., Altrock, P.M., Bosch, T.C., & Traulsen, A. (2015). Which games are growing bacterial populations playing? Journal of the Royal Society Interface, 12 (108) PMID: 26236827.
]]>Let me first set up my own teacup, before discussing the more general storm.
Recently, I’ve had a number of chances to present my work on computational complexity as an ultimate constraint on evolution. And some questions have repeated again and again after several of the presentations. I want to address one of these persistent questions in this post.
How common are hard fitness landscapes?
This question has come up during review, presentations, and emails (most recently from Jianzhi Zhang’s reading group). I’ve spent some time addressing it in the paper. But it is not a question with a clear answer. So unsurprisingly, my comments have not been clear. Hence, I want to use this post to add some clarity.
Let’s look at the general storm that Kovaka has identified: debates over frequency. Or debates over how typical certain effects are.
Whenever two or more interesting mechanisms or phenomena that are at odds with each other arise in biology, a debate emerges: how frequent is X? Are most cases Y, with X as just an edge case? Do most systems follow Y? Does X matter more? And other similar questions are asked.
Scientists then aggregate themselves into communities that bifurcate around the answers.
You can see this happening to some extent in polls run by Jeremy Fox on controversial ideas in ecology, and by Andrew Hendrey on controversial ideas in evolution. And Kovaka argues that the majority of fights in biology.
Sometimes it feels like these questions could be resolved empirically. But from Kovaka’s study, this is seldom the case. Heated frequency debates are not resolved empirically. In fact, they usually just peter out and fade. In some sense, it can feel in hindsight that these controversites don’t matter.
Let’s look at this in the context of my work on computational complexity.
I introduce the notion of easy landscapes, where the computation is not the limiting constraint, and populations can quickly find local fitness peaks. These describe how biologists have mostly thought about static fitness landscapes. As a contrast, I also define hard landscapes where computational is a limiting constraint and thus populations cannot find a local fitness peak in polynomial time. Within a hard landscape, you can easily resolve the frequency debate: no local peak can be found in polynomial time from most starting points, following any evolutionary dynamic (or any polynomial time algorithm more generally).
To establish this, I carry over several techniques from computer science. Most relevant to this case: worst case analysis.
A big difficulty arises in introducing radically different tools from theoretical computer science into biology. They require a thorough defence that I was hoping to delay until after I could attribute some success to the tools. But a careful reviewer 2 noticed this sleight-of-hand and asked me to mount the defence right away. A defence of worst-case analysis. I’ve know since at least 2014 that I’d have to provide compelling arguments. So before the paper was published, I already mounted a partial defence in the text, and more carefully in appendix D.3.
I’d encourage you to read these if you’re interested, dear reader. But I’ll try to discuss the same content in a slightly different way here.
I don’t reason about randomly generated fitness landscapes or the corresponding probability distributions over fitness landscapes. As such, I show the existence of hard fitness landscapes but I cannot reason about the likelihood of such instances. This is not a bug — it’s a feature. I don’t think it makes sense to talk about random fitness landscapes.
As reviewer 2 noted (and as I incorporated into the final text), this is a cultural difference between theoretical computer science and statistical physics. Since statistical physics provides the currently dominant toolset in evolutionary biology, I have an uphill struggle. But I think that cstheory is in the right here.
Fitness landscapes are huge objects. Hyper-astronomical objects. And although we’ve made some measurements of tiny landscapes or local snapshots of more realistic landscapes, it is conceptually impossible to measure a full fitness landscape exhaustively. They are simply too big.
If we can’t measure even one particular object. How is it reasonable to define a distribution that generates these objects? How would we ever test the reasonableness of this generating distribution?
More importantly, fitness landscapes are particular. A zebra is on a particular fitness landscape that exists due to various physical and biological laws. There isn’t some process that has generated random landscapes and some species ended up on some and some on others.
But these are ontological arguments. Let’s make a pragmatic one.
When people discuss classical fitness landscape results. They often talk about logical properties like the degree of epistasis and size of the landscape — but they seldom explicitly discuss (or change) the sampling distribution. They speak as if the assumed generating distribution is not an assumption, but just ignorance. An ignorance that doesn’t bias the conclusion.
But this isn’t the case for such high dimensional objects. In these cases, randomness gives structure. And that structure is highly dependent on the sampling distribution.
In the case of easy vs hard landscapes, I expect hardness results to be extremely sensitive to sampling distributions. I believe this since similar results exist for similar models, although I haven’t proved them yet for the NK-model. In particular, I expect that for samples from uniform distributions (even when properly defined to avoid the simple span arguments we can make against current distributions), hard landscapes will be rare. But if we sample from the inverse Kolmogorov distribution (i.e. landscapes with short descriptions are more likely than landscapes with long descriptions — like Occam’s razor) then my asymptotic hardness results will cary over: hard landscapes will be common.
Yet both the uniform distribution and Occam’s razor can be defended as reasonable choices of distribution. During a recent conversation, Colin Twomey even made a compeling argument for the direct opposite of Occam’s razor: that ‘typical’ fitness landscapes are the incompressible ones. So what should we make of this sensitivity of the model? We can’t use this to answer how frequent fitness landscapes are, but maybe that wasn’t the right question.
In the more general context, Kovaka gives us a way forward. She points out that the talk of ‘frequency’ is actually not important in general biological fights over frequency. This wording is just a frame and if taking literally, it seldom contributes to actual advancement of the field. What matters instead, is that in the process of arguing about how typical or rare particular phenomena or mechanisms are, biologists learn the logical limits and boundaries of these phenomena much better. It is these logical characterizations and interconnections that form the lasting contribution of frequency debates. It is these logical characterizations that showcase the causal patterns and regularities of phenomena. It is these logical characterizations that explain how a particular pattern will be established in future cases. In the end, it ends up that we do not actually want to know the frequency but instead the particular conditions for a pattern.
So what does this mean in the context of fitness landscapes? It means that we should do the next step that cstheory points us to: parametrized complexity. Figure out what logical features of (the descriptions of) fitness landscapes can guarantee easy ones and which can’t.
I am currently working on this. And I hope it ends up with more helpful questions than “how common are hard landscapes?”
]]>In their simulation, /u/abraxasknister has a fixed center (block dot) that the first mass (red dot) is attached to (by an invisible rigid massless bar). The second mass (blue dot) is then attached to the first mass (also by an invisible rigid massless bar). They then release these two masses from rest at some initial height and watch what happens.
The resulting dynamics are at right.
It is certainly unpredictable and complicated. Chaotic? Most importantly, it is obviously wrong.
But because the double pendulum is a famous chaotic system, some people did not want to acknowledge that there is an obvious mistake. They wanted to hide behind chaos: they claimed that for a complex system, we cannot possibly have intuitions about how the system should behave.
In this post, I want to discuss the error of hiding behind chaos, and how the distinction between microdynamics and global properties lets us catch /u/abraxasknister’s mistake.
A number of people on Reddit noticed the error with /u/abraxasknister’s simulation right away. But the interesting part for me was how other people then jumped in to argue that the correctors could not possibly know what they were talking about.
For example, /u/Rickietee10 wrote:
It’s based on [chaos] theory. … Saying it doesn’t [look] right isn’t even something you can say, because it’s completely random.
Or /u/chiweweman’s dismissing a correct diagnosis of the mistake with:
That’s possible, but also double pendulums involve chaos theory. It’s likely this is just a frictionless simulation.
These detractors were trying to hide behind complexity. They thought that unpredictable microdynamics meant that nothing about the system is knowable. Of course, they were wrong.
But their error is an interesting one. This seems like an unfortunately common misuse of chaos in some corners of complexology. We say that a some system (say the economy) is complex. Thus it is unknowable. Thus, people offering liner theories (say economists) cannot possibly know what they are talking about. They cannot possibly be right.
Have you encountered variants of this argument, dear reader?
This kind of argument is wrong. And in the case of the double pendulum, /u/GreatBigBagOfNope responded best:
You can’t just slap the word chaos on something and expect the conservation of energy to no longer apply
So let us use the conservation of energy to explain why the simulation is wrong.
From the initial conditions, we can get an estimate of the system’s energy. This is particularly easy in this case since the masses start at rest at some height — thus all energy is potential energy. From this — due to the time-invariance of the Hamiltonian specifying the double pendulum — we know by Noether’s theorem that this initial energy will be conserved. In this particular case, this means that we cannot ever have both of the masses above their initial position at the same time. If that happened then (just) the potential energy of this configuration will be strictly higher than the total initial energy. Since we see both of the masses simultaneously above their initial position in the gif, we can conclude that there is an error in /u/abraxasknister simulation.
I enjoy this kind of use of global abstract argument to reason without knowing the details of microdynamics. For me, this is the heart of theoretical computer science.
Based on this violation of energy conservation, many theories were discussed for what the error in the simulation might have been. And the possibility of energy-pumping from finite step size was a particularly exciting candidate. A ‘bug’ (that can be a ‘feature’) that I’ll discuss another day in the context of replicator dynamics.
The actual main mistake turned out to be much less exciting: a typo in the code. A psy instead of phi in one equation.
I’m sure that all of us that have coded simulations can relate. If only we always had something as nice as the conservation of energy to help us debug.
Of course, this doesn’t mean that there aren’t interesting discussions to be had on chaos and prediction. I’ve written before on computer science on prediction and the edge of chaos and on how stochasticity, chaos and computations can limit prediction. But we shouldn’t use chaos or complexity to stop ourselves from asking questions or making predictions. Instead, we should use apparent complexity as motivation to find the limits of more linear theories. And whatever system we work with, we should look for overarching global principles like the conservation of energy that we can use to abstract over the chaotic microdynamics.
]]>So for this week, I want to change things up a bit. I want to discuss some of the math behind a success of cstheory applied to nature: quantum computing. It’s been six years since I blogged about quantum query complexity and the negative adversary method for lower bounding it. And it has been close to 8 years since I’ve worked on the topic.
But I did promise to write about span programs — a technique used to reason about query complexity. So in this post, I want to shift gears to quantum computing and discuss span programs. I doubt this is useful to thinking about evolution, but it never hurts to discuss a cool linear-algebraic representation of functions.
I started writing this post for the CSTheory Community Blog. Unfortunately, that blog is largely defunct. So, after 6 years, I decided to post on TheEGG instead.
Please humour me, dear reader.
Span programs are a linear-algebraic representation of functions. They originated in the work of Karchmer and Wigderson [KW93] on classical complexity, and were introduced into quantum computing by Reichardt and Spalek [RS08] in the context of formulate evaluation. A span program consists of a target in a vector space , and a collection of subspaces for and . For an input , if then the target vector can be expressed as a linear combination of vectors in . For the classical complexity measures on span programs (size) the particular choice of bases for does not matter, but for the quantum witness size it is important to fix the set of “input vectors” that span each subspace.
Formally:
A span program consists of a “target” vector in a finite-dimensional inner-product space over , together with “input” vectors for . Here the index set is
a disjoint union:
corresponds to a function , defined by:
A span program is called strict if . In general, we can assume span programs are strict, a non-empty is only useful for optimizing some algorithmic considerations. In the classical literature only strict span programs were considered [KW93,Gal01,GP03]. In fact, the classical literature considers even more restrictive programs such
as monotone span programs [Gal01,GP03]. A span program is monotone if for all we have . For every monotone function there exists a monotone span program representing it and vice-versa. These programs also correspond to linear secret-sharing schemes, but as of 2011, were not yet studied from the quantum interpretation (have they since, dear reader?). Unlike monotone circuits, monotone span programs are believed to be less restrictive with respect to the natural classical notion of span program complexity.
The classical notion of complexity for span programs is called size. The size of a span program is the number of input vectors , and the size of a function is then the minimum size over all span programs that represent the function [KW93]. For the correspondence between span programs and quantum query complexity, however, we have to consider a different measure of complexity known as witness size [RS08].
Consider a span program . Let . For each input , let and , and
Let the witness size of be .
Note that there is a certain imprecision in how we specified a span program. In particular, if we replace the target vector by (), then we change the witness size by a factor of if or if . Thus we might as well have defined the witness size as:
However, we will see this is unnecessary since we can transform any span program into a canonical span program:
A span program is canonical if , the target vector is , and for all and , .
Using classical techniques [KW93] we can show that this does not increase our complexity measures:
A span program can be converted to a canonical span program that computes the same function, with and . For all with , itself is an optimal witness.
This simplifies the definition of witness size, and we can write down an optimization problem to solve for the smallest witness size of a function, as:
Notice the similarity of the above equation and the dual of the adversary method. The similarity is no coincidence: the former is the dual of the negative adversary method:
For a proof, I direct the interested reader to Ref.[Rei09].
[Gal01] Anna Gal. A characterization of span program size and improved lower bounds for monotone span programs. Computational Complexity, 10:277-296, 2001.
[GP03] Anna Gal and Pavel Pudlak. A note on monotone complexity and the rank of matrices. Information Processing Letters, 87:321-326, 2003.
[KW93] Mauricio Karchmer and Avi Wigderson. On span programs. In Proc. of 8th IEEE Symp. Structure of Complexity Theory, pages 102-111, 1993.
[Rei09] Ben W. Reichardt. Span programs and quantum query complexity: The general adversary bound is nearly tight for every boolean function. In 2009 50th Annual IEEE Symposium on Foundations of Computer Science, pages 544-551. IEEE, 2009, arXiv:0904.2759v1.
[RS08] Ben W. Reichardt and Robert Spalek. Span-program based quantum algorithm for evaluating formulas. In Proc. 40th ACM STOC, pages 103-112, 2008, arXiv:0710.2630.
]]>Last week, I tried to continue this thought for Oxford students at a joint meeting of the Computational Society and Biological Society. On May 22, I gave a talk on algorithmic biology. I want to use this post to share my (shortened) slides as a pdf file and give a brief overview of the talk.
If you didn’t get a chance to attend, maybe the title and abstract will get you reading further:
Algorithmic Biology: Evolution is an algorithm; let us analyze it like one.
Evolutionary biology and theoretical computer science are fundamentally interconnected. In the work of Charles Darwin and Alfred Russel Wallace, we can see the emergence of concepts that theoretical computer scientists would later hold as central to their discipline. Ideas like asymptotic analysis, the role of algorithms in nature, distributed computation, and analogy from man-made to natural control processes. By recognizing evolution as an algorithm, we can continue to apply the mathematical tools of computer science to solve biological puzzles – to build an algorithmic biology.
One of these puzzles is open-ended evolution: why do populations continue to adapt instead of getting stuck at local fitness optima? Or alternatively: what constraint prevents evolution from finding a local fitness peak? Many solutions have been proposed to this puzzle, with most being proximal – i.e. depending on the details of the particular population structure. But computational complexity provides an ultimate constraint on evolution. I will discuss this constraint, and the positive aspects of the resultant perpetual maladaptive disequilibrium. In particular, I will explain how we can use this to understand both on-going long-term evolution experiments in bacteria; and the evolution of costly learning and cooperation in populations of complex organisms like humans.
Unsurprisingly, I’ve writen about all these topics already on TheEGG, and so my overview of the talk will involve a lot of links back to previous posts. In this way. this can serve as an analytic linkdex on algorithmic biology.
One of my students, Joe Gardner, invited me to give this talk. Together with Ben Slater, he was excited about my recent paper on computational complexity as an ultimate constraint on evolution. They thought that other students would also be interested, and that this could be a good way to bring the Computational Society and Biological Society together for a joint event.
I was more than happy to participate.
But given the technical nature of some of my work, I wanted to focus on a broad overview of algorithmic biology and evolution. And only briefly touch on my specific work at the end. The talk ended up having four main parts:
In terms of the actual content, the talk followed the ideas sketched in the following posts from TheEGG:
Quick introduction: the algorithmic lens (March 29, 2019)
Most people are familiar with a particular form of interaction between computer science and biology — what can be broadly classified as computational biology. This is areas like bioinformatics, agent-based modeling of evolution, and maybe even extending to topics like genetic programming and genetic algorithms. But this isn’t the only aspect of the boundary between computer science and biology. This is just practical skills from computer science applied to the outputs of biology. A complementary approach is the use of mathematical techniques from computer science to build the conceptual grounding of biology. This was the aspect that I wanted to focus the talk on.
For this, I needed to give some history of evolution:
British agricultural revolution gave us evolution by natural selection (May 25, 2019)
If we look at the origins of evolution by natural selection, we can find a strong motivating factor from technology. One of the technological achievements of Darwin’s time was rapid improvements in animal husbandry and agriculture. In the above post, I make the case that this technology was an essential part of the inspiration for Darwin’s foundational insights. Darwin looked at selection implemented by Robert Bakewell and asked if instead it can be implemented by a non-human agent — an abstract agent like the struggle for existence.
Darwin as an early algorithmic biologist (August 4th, 2018)
In doing this, Darwin was being an early algorithmic biologist. He was recognizing the importance of multiple-realizability and the fact that algorithms can be implemented in a distributed manner. In viewing the struggle for existence as the implementing agent for natural selection, Darwin was also using asymptotic analysis: seeing the qualitatively difference in growth rate between the constant or polynomially increasing abundance of resources versus the exponential growth of populations.
Fitness landscapes as mental and mathematical models of evolution (August 16th, 2013)
To make this process of evolution easier to think about and model, in the century after Darwin, Wright developed the fitness landscape metaphor. The discrete aspect of the metaphor is especially useful if we want to incorporate the discrete elements of Mendelian genetics — something that was foreign to Darwin, but is an essential part of our current biological thought. But fitness landscapes also raise an issue: why aren’t all species just stuck at local fitness peaks? Why do we see any evolution happening at all?
Darwin and Wallace were geologists, so they overcome the local peak problem by having the environment constantly change. For them the world is constantly changing at the geological level and thus geological change gets reflected in the biological world. This is almost certainly a great explanation for a naturalist, but an experimentalist can throw a wrench in this reasoning. Richard Lenski has done this by evolving E. coli for (now) 70,000 generations in a static environment. They still haven’t found a fitness peak and continue to increase in fitness. Lenski and colleagues see open-ended evolution.
This is part of what I aim to explain.
From perpetual motion machines to the Entscheidungsproblem (March 9th, 2019)
But before explaining open-ended evolution, it is important to understand the ideas of multiple realizability and abstraction. This can be illustrated by turning to the other big new technology of Darwin’s time: steam engines.
If today someone came to you with plans for a perpetual motion machine, you would know that those plans are wrong without even examining them. This is due to foundational role that thermodynamics has achieved. We don’t need to know the details of the proposed machine to know that it would have to violate the laws of physics as we know them to achieve its result.
Computational complexity can achieve a similar foundational reach. Just like we don’t need to think about thermodynamics as about steam engines, we don’t need to see computational complexity as about computers. Any system can be viewed an analyzed as an algorithm — including evolution. And thus computational constrants can be applied to evolution.
Most importantly, this means that we should analyze evolution as an algorithm.
And if we can’t analyze the algorithm then shouldn’t attribute random properties to it that ‘feel right’, seem obvious on one step case, or that we want to be true. That’s not good to practice when we’re programming and it’s not good practice when we’re doing science, either.
In particular, this means that we should shift some of our metaphors for fitness landscapes from mountain ranges to mazes.
Proximal vs ultimate constraints on evolution (July 24th, 2018)
In evolutionary biology, constraints are what prevent evolution from finding peaks (or exits in the maze metaphor) in fitness landscapes. These constraints can be caused by various factors. By analogy to the distinction between algorithms vs problems, I divide the constraints into two types: proximal vs ultimate.
Most of the constraints on evolution familiar to biologists are proximal. They are due to features of the particular population, like population or developmental structure, trait co-variants, standing variation, etc. In contrast, computational complexity is an ultimate constraint: it applies to any population on a given family of landscapes.
From this background, I could describe my recent results:
Computational Complexity as an Ultimate Constraint on Evolution
This involves thinking about the different kinds of epistasis and corresponding landscapes from smooth to semismooth to rugged. How we can get progressively harder landscapes if we have more freedom on the kind of epistasis that occurs. And how this resultant perpetual maladaptive disequilibrium can be transformed into positive results like solutions to the Baldwin effect for costly learning, or making permanent cooperation due to the Hankshaw effect.
For now, I think that the best summary of these results is still my 25 tweet thread on this paper. But maybe I should consider writing a more linkable post in the future.
]]>Today we might discuss artificial selection or domestication (or even evolutionary oncology) as applying the principles of natural selection to achieve human goals. This is only because we now take Darwin’s work as given. At the time that he was writing, however, Darwin actually had to make his argument in the other direction. Darwin’s argument proceeds from looking at the selection algorithms used by humans and then abstracting it to focus only on the algorithm and not the agent carrying out the algorithm. Having made this abstraction, he can implement the breeder by the distributed struggle for existence and thus get natural selection.
The inspiration is clearly from the technological to the theoretical. But there is a problem with my story.
Domestication of plants and animals in ancient. Old enough that we have cancers that arose in our domesticated helpers 11,000 years ago and persist to this day. Domestication in general — the fruit of the first agricultural revolution — can hardly qualify as a new technology in Darwin’s day. It would have been just as known to Aristotle, and yet he thought species were eternal.
Why wasn’t Aristotle or any other ancient philosopher inspired by the agriculture and animal husbandry of their day to arrive at the same theory as Darwin?
The ancients didn’t arrive at the same view because it wasn’t the domestication of the first agricultural revolution that inspired Darwin. It was something much more contemporary to him. Darwin was inspired by the British agricultural revolution of the 18th and early 19th century.
In this post, I want to sketch this connection between the technological development of the Georgian era and the theoretical breakthroughs in natural science in the subsequent Victorian era. As before, I’ll focus on evolution and algorithm.
What was the British agricultural revolution? Was it even seen as a revolution in its own time or recognized only in hindsight? For this, we can turn to Victorian texts. We can see a description of the revolution directly in Darwin’s writings. For example, in his October 1857 letter to Asa Gray, Darwin writes: “[s]election has been methodically followed in Europe for only the last half century” (the emphasis is original).
‘Methodical’ is key. And the innovation that Darwin is alluding to originated in Leicestershire with Robert Bakewell and was popularized by Thomas Coke, 1st Early of Leicester.
Bakewell built a mechanistic approach to agriculture. Not the replacement of farm workers by machines, but the methodical scientific approach to agriculture. The introduction of methodical inductive empiricism.
In particular, Bakewell is most remembered for the Dishley system — known today as line breeding or breeding in-and-in. Prior to Bakewell, livestock of both sexes was kept together in fields. This resulted in natural assortment between the livestock and did not easily produce systematic traits in the offspring — to the casual onlooker, the traits in the offspring of these populations would be diverse and seemingly random. Bakewell seperated the sexes and only allowed deliberate, specific mating. This allowed him to more easily and rapidly select for desired traits in his livestock.
During the 35 years from Robert Bakewell inheriting his father’s farm in 1760 to his own death in 1795, he developed several new breeds of livestock including new kinds of sheep, cattle, and horses.
It was apparent to any observed that these were different variations on species. For example, they produced more wool, gained more weight more quickly, and were easier to work with than prior livestock. During Bakewell’s lifetime, the average weight of bulls at action is reported to have doubled.
The Dishley system — i.e. Bakewell’s algorithm — clearly produced new varieties. The puzzle was now an algorithmic one: was a human breeder required to implement this algorithm, or was this always taking place even without human intervention.
Thus, in recognizing (artificial) selection, Darwin was not extracting an implicit algorithm from a long-held human practice. Rather, he was taking an explicit algorithm advocated and practiced by his contemporaries. In the On the Origin of Species, Darwin explicitly acknowledges Bakewell’s demonstration of variation under domestication, and even discusses the branching of Bakewell’s variations under the breeding of different farmers (Buckley vs. Burgess).
Darwin’s contribution to Backewell’s algorithm was abstracting it: recognizing that the agent that implements the algorithm is irrelevant. We don’t need to have Robert Bakewell or another agriculturalist do the selecting. Instead, we can have a distributed agent like the struggle for existence. It is this algorithmic abstraction that allowed Darwin to revolutionize how we think about nature. But it was the latest technology of his day that led him there. Darwin took a human algorithm and asked if it can also explain nature.
Bakewell’s contribution to the technology of agriculture and influence on the future of evolutionary theory extends beyond breeding. He also established experimental plots on his farm to test different manure and irrigation methods. This practice was part of the inspiration for John Bennet Lawes’ establishment of the Rothamsted Experimental Station in 1843 for carrying out long-term agricultural experiments. Their 1856 Park Grass Experiment is still ongoing. But the station is perhaps best known for its theoretical contribution to evolutionary biology during the 14-year tenure (1919–1933) of Ronald Fisher. While at Rothamsted, Fisher developed the statistics and population genetics of the modern evolutionary synthesis to make sense of the data from these ongoing evolutionary experiments.
And the inspiration on evolution from technology was not limited to agriculture. Steam engines — the other new tech of the day, and one whose study I’ve already compared to computer science on TheEGG — also make an appearance in the first publication of natural selection in 1858. In his section, Alfred Russel Wallace writes that the “action of [natural selection] is exactly like that of the centrifugal governor of the steam engine, which checks and corrects any irregularities almost before they become evident.” An analogy to another recent technology; this one introduced into common usage by James Watt in 1788.
It is easy to imagine history as going from idea to technology. But I think this is often an anachronism. Rather, technology can lead foundational ideas. The tools we build to understand and develop technology can often form the basis for the abstractions and analogies that create new theoretical, philosophical, and scientific ideas.
Today, we should look at the computer — the big technology from the last 50 years — not as just a practical tool. We need to recognize the principles and mathematics underlying algorithms and computation not as just technological aids but as means to fundamentally transform our understanding of nature. And as with Darwin and Wallace, I propose that we focus that transformation on our understanding of biology. Maybe the computing revolution will give us algorithmic biology as the next development in our understanding of the world around us.
]]>A couple of years ago, Robert had a computer science question. One at the data analysis and visualization stage of the relationship between computer science and cancer. Given that I haven’t posted code on TheEGG in a long time, I thought I’d share some visualizations I wrote to address Robert’s question.
There are many ways to measure the size of populations in biology. Given that we use it in our game assay, I’ve written a lot about using time-lapse microscopy of evolving populations. But this isn’t the only — or most popular — approach. It is much more common to dillute populations heavily and then count colony forming units (CFUs). I’ve discussed this briefly in the context of measuring stag-hunting bacteria.
But you can also combine both approaches. And do time-lapse microscopy of the colonies as they form.
A couple of years ago, Robert Vander Velde Andriy Marusyk were working on experiments that use colony forming units (CFUs) as a measure of populations. However, they wanted to dig deeper into the heterogeneous dynamics of CFUs by tracking the formation process through time-lapsed microscopy. Robert asked me if I could help out with a bit of the computer vision, so I wrote a Python script for them to identify and track individual colonies through time. I thought that the code might be useful to others — or me in the future — so I wanted to write a quick post explaining my approach.
This post ended up trapped in the drafts box of TheEGG for a while, but I thought now is as good a time as any to share it. I don’t know where Robert’s work on this has gone since, or if the space-time visualizations I developed were of any use. Maybe he can fill us in in the comments or with a new guest post.
So let’s just get started with the code.
Of course, we first need to import the main packages: numpy
, pyplot
, and cv2
.
import numpy as np import cv2 from matplotlib import pyplot as plt
The first two are standard packages, the last one — OpenCV — takes a little bit more work to install.
Now we do two main tasks at once, we load all the images and create something I want to call a ‘space-time map’. A space-time map is an image that uses the colour map of a pixel to represent the number of time points that it appears in. This is the first name that occurred to me, if you’ve seen this visualisation used before, dear reader, and know its name then please let me know.
threshImgs_all = [] num_imgs = 24 #load the images and create space-time image (total_img) img_all = [] total_img = np.zeros_like(cv2.imread(str(0).zfill(4) + '.tif',cv2.IMREAD_GRAYSCALE)) for img_num in range(0,24): f_name = str(img_num).zfill(4) + '.tif' img = cv2.bitwise_not(cv2.imread(f_name,cv2.IMREAD_GRAYSCALE)) img_all.append(img) total_img = cv2.scaleAdd(img,1/num_imgs,total_img) plt.imshow(total_img, cmap="magma_r") plt.show()
This results in an image like:
From this image, we can get persistent numbers for all the colonies that existed:
#get the colonies _, total_thresh = cv2.threshold(img,127,255,cv2.THRESH_BINARY) _, total_colonies = cv2.connectedComponents(total_thresh) num_colonies = np.amax(total_colonies) print("There are " + str(num_colonies) + " colonies")
More importantly, the image total_colonies
now has each non-background pixel labeled by its colony number, so counting the number of pixels in each colony at each time point becomes as straightforward as applying a mask:
#use the total image (properly thresholded) as the permanent numbers for the colonies; #get future colonies numbers from them) colony_sizes = np.zeros((num_colonies + 1,num_imgs), dtype=np.int) #Note that colony_size[0,:] will contain the amount of empty space img_num = 0 for img in img_all: #label colonies by their numbers (for upto 255 colonies): labeled_img = np.minimum(total_colonies,img) #get the colonies that appear and their sizes colonies, sizes = np.unique(labeled_img, return_counts = True) colony_sizes[colonies,img_num] = sizes img_num += 1 #plt.imshow(total_colonies, cmap='magma_r') for colony in range(1,num_colonies + 1): plt.plot(colony_sizes[colony,:]) plt.yscale('log') plt.show()
Unfortunately, there is a number of colonies that ‘blink’ in and out of existance. This is not a manifestation of reality, but probably an artefact of the image processing software used to produce the initial threshold images and the sensitivity of the microscope. As such, it can be helpful to clean up the time series and focus only on the colonies that didn’t go to extinct during the experiment and look at their population dynamics.
#let's clean up by eliminating colonies that go extinct at some point. colony_lifetimes = np.sum(colony_sizes > 0, axis = 1) surviving_colonies = np.where(colony_lifetimes == num_imgs)[0][1:] print(surviving_colonies) for colony in surviving_colonies: plt.plot(colony_sizes[colony,:]) plt.yscale('log') plt.show()
But the figure that this produces still a difficult figure to make sense. I don’t even bother to produce it given how many different lines there are going in different directions for growth rates.
What we really care about is higher level properties of these colonies like their growth rate, so let’s infer those with the help of scipy
:
#among those that don't go extinct, let's calculate the growth rates from scipy.stats import mstats growth_rates = np.zeros(num_colonies + 1) growth_rates_low = np.zeros(num_colonies + 1) growth_rates_high = np.zeros(num_colonies + 1) for colony in surviving_colonies: growth_rates[colony], _, growth_rates_low[colony], growth_rates_high[colony] = \ mstats.theilslopes(np.log(colony_sizes[colony,:])) plt.errorbar(np.arange(num_colonies + 1),growth_rates, yerr = [growth_rates - growth_rates_low, growth_rates_high - growth_rates],fmt='ko') plt.xlabel('Colony number') plt.ylabel('Growth rate') plt.show()
This yields an easier to look at colony growth rate plot, with 95% confidence intervals.
Above, we have a fitness measure for each colony, so we can look not only at the number of colony forming units but also at differences in how the colonies formed. I still find it hard to make sense of this particular plot, but looking explicitly at the inter-colony heterogeneity does seem like a good exercise. Definitely better than just summarising it as a single variance. Especially since I know from experience that sometimes a variance can hide an interesting discovery.
How would you, dear reader, extend these visualizations? Or is there a good use that you can think of putting them to? After all, visualizations are one of the most important parts of science. I hope this code helps a little. At least as inspiration, or an example of how easy it is to get things done with Python.
]]>