## Idealization vs abstraction for mathematical models of evolution

This week I was in Turku, Finland for the annual congress of the European Society for Evolutionary Biology. I presented in the symposium on mathematical models in evolutionary biology organized by Guy Cooper, Matishalin Patel, Tom Scott, and Asher Leeks. It was a fun. It was also a big challenge given the short ten minute format. I decided to use my ten minutes to try to convince the audience that we should consider not just idealized models but also abstractions. So after my typical introduction of computational vs algorithmic biology, I switched to talking about triangles. If you would like, dear reader, then you can watch the whole session online (or grab my slides as pdf). In this post, I just want to focus on the distinction between idealized vs. abstract models.

Just as in my ESEB talk, I’ll use triangles to explain the distinction between idealized vs. abstract models.

## Allegory of the replication crisis in algorithmic trading

One of the most interesting ongoing problems in metascience right now is the replication crisis. This a methodological crisis around the difficulty of reproducing or replicating past studies. If we cannot repeat or recreate the results of a previous study then it casts doubt on if those ‘results’ were real or just artefacts of flawed methodology, bad statistics, or publication bias. If we view science as a collection of facts or empirical truths than this can shake the foundations of science.

The replication crisis is most often associated with psychology — a field that seems to be having the most active and self-reflective engagement with the replication crisis — but also extends to fields like general medicine (Ioannidis, 2005a,b; 2016), oncology (Begley & Ellis, 2012), marketing (Hunter, 2001), economics (Camerer et al., 2016), and even hydrology (Stagge et al., 2019).

When I last wrote about the replication crisis back in 2013, I asked what science can learn from the humanities: specifically, what we can learn from memorable characters and fanfiction. From this perspective, a lack of replication was not the disease but the symptom of the deeper malady of poor theoretical foundations. When theories, models, and experiments are individual isolated silos, there is no inherent drive to replicate because the knowledge is not directly cumulative. Instead of forcing replication, we should aim to unify theories, make them more precise and cumulative and thus create a setting where there is an inherent drive to replicate.

More importantly, in a field with well-developed theory and large deductive components, a study can advance the field even if its observed outcome turns out to be incorrect. With a cumulative theory, it is more likely that we will develop new techniques or motivate new challenges or extensions to theory independent of the details of the empirical results. In a field where theory and experiment go hand-in-hand, a single paper can advance both our empirical grounding and our theoretical techniques.

I am certainly not the only one to suggest that a lack of unifying, common, and cumulative theory as the cause for the replication crisis. But how do we act on this?

Can we just start mathematical modelling? In the case of the replicator crisis in cancer research, will mathematical oncology help?

Not necessarily. But I’ll come back to this at the end. First, a story.

Let us look at a case study: algorithmic trading in quantitative finance. This is a field that is heavy in math and light on controlled experiments. In some ways, its methodology is the opposite of the dominant methodology of psychology or cancer research. It is all about doing math and writing code to predict the markets.

Yesterday on /r/algotrading, /u/chiefkul reported on his effort to reproduce 130+ papers about “predicting the stock market”. He coded them from scratch and found that “every single paper was either p-hacked, overfit [or] subsample[d] …OR… had a smidge of Alpha [that disappears with transaction costs]”.

There’s a replication crisis for you. Even the most pessimistic readings of the literature in psychology or medicine produce significantly higher levels of successful replication. So let’s dig in a bit.

## Process over state: Math is about proofs, not theorems.

A couple of days ago, Maylin and I went to pick blackberries along some trails near our house. We spent a number of hours doing it and eventually I turned all those berries into one half-litre jar of jam.

On the way to the blackberry trails, we passed a perfectly fine Waitrose — a supermarket that sells (among countless other things) jam. A supermarket I had to go to later anyways to get jamming sugar. Why didn’t we just buy the blackberries or the jam itself? It wasn’t a matter of money: several hours of our time picking berries and cooking them cost much more than a half-litre of jam, even from Waitrose.

I think that we spent time picking the berries and making the jam for the same reason that mathematicians prove theorems.

Imagine that you had a machine where you put in a statement and it replied with perfect accuracy if that statement was true or false (or maybe ill-posed). Would mathematicians welcome such a machine? It seems that Hilbert and the other formalists at the start of the 20th century certainly did. They wanted a process that could resolve any mathematical statement.

Such a hypothetical machine would be a Waitrose for theorems.

But is math just about establishing the truth of mathematical statements? More importantly, is the math that is written for other mathematicians just about establishing the truth of mathematical statements?

I don’t think so.

Math is about ideas. About techniques for thinking and proving things. Not just about the outcome of those techniques.

This is true of much of science and philosophy, as well. So although I will focus this post on the importance of process over state/outcome in pure math, I think it can also be read from the perspective of process over state in science or philosophy more broadly.

## Generating random power-law graphs

‘Power-law’ is one of the biggest buzzwords in complexology. Almost everything is a power-law. I’ve even used it to sell my own work. But most work that deals in power-laws tends to lack rigour. And just establishing that something is a power-law shouldn’t make us feel that it is more connected to something else that is a power-law. Cosma Shalizi — the great critic of sloppy thinking in complexology — has an insightful passage on power-laws:

[T]here turn out to be nine and sixty ways of constructing power laws, and every single one of them is right, in that it does indeed produce a power law. Power laws turn out to result from a kind of central limit theorem for multiplicative growth processes, an observation which apparently dates back to Herbert Simon, and which has been rediscovered by a number of physicists (for instance, Sornette). Reed and Hughes have established an even more deflating explanation (see below). Now, just because these simple mechanisms exist, doesn’t mean they explain any particular case, but it does mean that you can’t legitimately argue “My favorite mechanism produces a power law; there is a power law here; it is very unlikely there would be a power law if my mechanism were not at work; therefore, it is reasonable to believe my mechanism is at work here.” (Deborah Mayo would say that finding a power law does not constitute a severe test of your hypothesis.) You need to do “differential diagnosis”, by identifying other, non-power-law consequences of your mechanism, which other possible explanations don’t share. This, we hardly ever do.

The curse of this multiple-realizability comes up especially when power-laws intersect with the other great field of complexology: networks.

I used to be very interested in this intersection. I was especially excited about evolutionary games on networks. But I was worried about some of the arbitrary seeming approaches in the literature to generating random power-law graphs. So before starting any projects with them, I took a look into my options. Unfortunately, I didn’t go further with the exploration.

Recently, Raoul Wadhwa has gone much more in-depth in his thinking about graphs and networks. So I thought I’d share some of my old notes on generating random power-law graphs in the hope that they might be useful to Raoul. These notes are half-baked and outdated, but maybe still fun.

Hopefully, you will find them entertaining, too, dear reader.

## Blogging community of computational and mathematical oncologists

A few weeks ago, David Basanta reached out to me (and many other members of the mathematical oncology community) about building a community blog together. This week, to coincide with the Society for Mathematical Biology meeting in Montreal, we launched the blog. In keeping with the community focus, we have an editorial board of 8 people that includes (in addition to David and me): Christina Curtis, Elana Fertig, Stacey Finley, Jakob Nikolas Kather, Jacob G. Scott, and Jeffrey West. The theme is computational and mathematical oncology, but we welcome contributions from all nearby disciplines.

The behind the scenes discussion building up to this launch was one of the motivators for my post on twitter vs blogs and science advertising versus discussion. And as you might expect, dear reader, it was important to me that this new community blog wouldn’t be just about science outreach and advertising of completed work. For me — and I think many of the editors — it is important that the blog is a place for science engagement and for developing new ideas in the open. A way to peel back the covers that hide how science is done and break the silos that inhibit a collaborative and cooperative atmosphere. A way to not only speak at the public or other scientists, but also an opportunity to listen.

For me, the blog is a challenge to the community. A challenge to engage in more flexible, interactive, and inclusive development of new ideas than is possible with traditional journals. While also allowing for a deeper, more long-form and structured discussion than is possible with twitter. If you’ve ever written a detailed research email, long discussion on Slack, or been part of an exciting journal club, lab meeting, or seminar, you know the amount of useful discussion that is foundational to science but that seldom appears in public. My hope is that we can make these discussions more public and more beneficial to the whole community.

Before pushing for the project, David made sure that he knew the lay of the land. He assembled a list of the existing blogs on computational and mathematical oncology. In our welcome post, I made sure to highlight a few of the examples of our community members developing new ideas, sharing tools and techniques, and pushing beyond outreach and advertising. But since we wanted the welcome post to be short, there was not the opportunity for a more thorough survey of our community.

In this post, I want to provide a more detailed — although never complete nor exhaustive — snapshot of the blogging community of computational and mathematical oncologists. At least the part of it that I am familiar with. If I missed you then please let me know. This is exactly what the comments on this post are for: expanding our community.

## Closing the gap between quantum and deterministic query complexity for easy to certify total functions

Recently, trying to keep with my weekly post schedule, I’ve been a bit strapped for inspiration. As such, I’ve posted a few times on a major topic from my past life: quantum query complexity. I’ve mostly tried to describe some techniques for (lower) bounding query complexity like the negative adversary method and span programs. But I’ve never really showed how to use these methods to actually set up interesting bounds.

Since I am again short of a post, I thought I’d share this week a simple proof of a bound possible with these techniques. This is based on an old note I wrote on 19 April 2011.

One of the big conjectures in quantum query complexity — at least a half decade ago when I was worrying about this topic — is that quantum queries give you at most a quadratic speedup over deterministic queries for total functions. In symbols: $D(f) = O(Q^2(f))$. Since Grover’s algorithm can give us a quadratic quantum speed-up for arbitrary total functions, this conjecture basically says: you can’t do better than Grover.

In this post, I’ll prove a baby version of this conjecture.

Let’s call a Boolean total-function easy to certify if one side of the function has a constant-length certificate complexity. I’ll prove that for easy-to-certify total functions, $D(f) = O(Q^2(f))$.

This is not an important result, but I thought it is a cute illustration of standard techniques. And so it doesn’t get lost in my old pdf, I thought I’d finally convert it to a blog post. Think of this as a simple application of the adversary method.

## The gene-interaction networks of easy fitness landscapes

Since evolutionary fitness landscapes have been a recurrent theme on TheEGG, I want to return, yet again, to the question of finding local peaks in fitness landscapes. In particular, to the distinction between easy and hard fitness landscapes.

Roughly, in easy landscapes, we can find local peaks quickly and in hard ones, we cannot. But this is very vague. To be a little more precise, I have to borrow the notion of orders of growth from the asymptotic analysis standard in computer science. A family of landscapes indexed by a size n (usually corresponding to the number of genes in the landscape) is easy if a local fitness optimum can be found in the landscapes in time polynomial in n and hard otherwise. In the case of hard landscapes, we can’t guarantee to find a local fitness peak and thus can sometimes reason from a state of perpetual maladaptive disequilibrium.

In Kaznatcheev (2019), I introduced this distinction to biology. Since hard landscapes have more interesting properties which are more challenging to theoretical biologist’s intuitions, I focused more on this. This was read — perhaps rightly — as me advocating for the existence or ubiquity of hard landscapes. And that if hard landscapes don’t occur in nature then my distinction is pointless. But I don’t think this is the most useful reading.

It certainly would be fun if hard landscapes were a feature of nature since they give us a new way to approach certain puzzles like the maintenance of cooperation, the evolution of costly learning, or open-ended evolution. But this is an empirical question. What isn’t a question is that hard landscape are a feature of our mental and mathematical models of evolution. As such, all — or most, whatever that means — fitness landscapes being easy is still exciting for me. It means that the easy vs hard distinction can push us to refine our mental models such that if only easy landscapes occur in nature then our models should only be able to express easy landscapes.

In other words, using computational complexity to build upper-bounds arguments (that on certain classes of landscapes, local optima can be found efficiently) can be just as fun as lower-bounds arguments (that on certain classes of landscapes, evolution requires at least a super-polynomial effort to find any local fitness peak). However, apart from a brief mention of smooth landscapes, I did not stress the upper-bounds in Kaznatcheev (2019).

Now, together with David Cohen and Peter Jeavons, I’ve taken this next step — at least in the cstheory context, we still need to write on the biology. So in this post, I want to talk briefly about a biological framing of Kaznatcheev, Cohen & Jeavons (2019) and the kind of fitness landscapes that are easy for evolution.

## Twitter vs blogs and science advertising vs discussion

I read and write a lot of science outside the traditional medium of papers. Most often on blogs, twitter, and Reddit. And these alternative media are colliding more and more with the ‘mainstream media’ of academic publishing. A particularly visible trend has been the twitter paper thread: a collection of tweets that advertise a new paper and summarize its results. I’ve even written such a thread (5-6 March) for my recent paper on how to use cstheory to think about evolution.

Recently, David Basanta stumbled across an old (19 March) twitter thread by Dan Quintana for why people should use such twitter threads, instead of blog posts, to announce their papers. Given my passion for blogging, I think that David expected me to defend blogs against this assault. But instead of siding with David, I sided with Dan Quintana.

If you are going to be ‘announcing’ a paper via a thread then I think you should use a twitter thread, not a blog. At least, that is what I will try to stick to on TheEGG.

Yesterday, David wrote a blog post to elaborate on his position. So I thought that I would follow suit and write one to elaborate mine. Unlike David’s blog, TheEGG has comments — so I encourage you, dear reader, to use those to disagree with me.

## Description before prediction: evolutionary games in oncology

As I discussed towards the end of an old post on cross-validation and prediction: we don’t always want to have prediction as our primary goal, or metric of success. In fact, I think that if a discipline has not found a vocabulary for its basic terms, a grammar for combining those terms, and a framework for collecting, interpreting, and/or translating experimental practice into those terms then focusing on prediction can actually slow us down or push us in the wrong direction. To adapt Knuth: I suspect that premature optimization of predictive potential is the root of all evil.

We need to first have a good framework for describing and summarizing phenomena before we set out to build theories within that framework for predicting phenomena.

In this brief post, I want to ask if evolutionary games in oncology are ready for building predictive models. Or if they are still in need of establishing themselves as a good descriptive framework.

## Fighting about frequency and randomly generating fitness landscapes

A couple of months ago, I was in Cambridge for the Evolution Evolving conference. It was a lot of fun, and it was nice to catch up with some familiar faces and meet some new ones. My favourite talk was Karen Kovaka‘s “Fighting about frequency”. It was an extremely well-delivered talk on the philosophy of science. And it engaged with a topic that has been very important to discussions of my own recent work. Although in my case it is on a much smaller scale than the general phenomenon that Kovaka was concerned with,

Let me first set up my own teacup, before discussing the more general storm.

Recently, I’ve had a number of chances to present my work on computational complexity as an ultimate constraint on evolution. And some questions have repeated again and again after several of the presentations. I want to address one of these persistent questions in this post.

How common are hard fitness landscapes?

This question has come up during review, presentations, and emails (most recently from Jianzhi Zhang’s reading group). I’ve spent some time addressing it in the paper. But it is not a question with a clear answer. So unsurprisingly, my comments have not been clear. Hence, I want to use this post to add some clarity.