Allegory of the replication crisis in algorithmic trading

One of the most interesting ongoing problems in metascience right now is the replication crisis. This a methodological crisis around the difficulty of reproducing or replicating past studies. If we cannot repeat or recreate the results of a previous study then it casts doubt on if those ‘results’ were real or just artefacts of flawed methodology, bad statistics, or publication bias. If we view science as a collection of facts or empirical truths than this can shake the foundations of science.

The replication crisis is most often associated with psychology — a field that seems to be having the most active and self-reflective engagement with the replication crisis — but also extends to fields like general medicine (Ioannidis, 2005a,b; 2016), oncology (Begley & Ellis, 2012), marketing (Hunter, 2001), economics (Camerer et al., 2016), and even hydrology (Stagge et al., 2019).

When I last wrote about the replication crisis back in 2013, I asked what science can learn from the humanities: specifically, what we can learn from memorable characters and fanfiction. From this perspective, a lack of replication was not the disease but the symptom of the deeper malady of poor theoretical foundations. When theories, models, and experiments are individual isolated silos, there is no inherent drive to replicate because the knowledge is not directly cumulative. Instead of forcing replication, we should aim to unify theories, make them more precise and cumulative and thus create a setting where there is an inherent drive to replicate.

More importantly, in a field with well-developed theory and large deductive components, a study can advance the field even if its observed outcome turns out to be incorrect. With a cumulative theory, it is more likely that we will develop new techniques or motivate new challenges or extensions to theory independent of the details of the empirical results. In a field where theory and experiment go hand-in-hand, a single paper can advance both our empirical grounding and our theoretical techniques.

I am certainly not the only one to suggest that a lack of unifying, common, and cumulative theory as the cause for the replication crisis. But how do we act on this?

Can we just start mathematical modelling? In the case of the replicator crisis in cancer research, will mathematical oncology help?

Not necessarily. But I’ll come back to this at the end. First, a story.

Let us look at a case study: algorithmic trading in quantitative finance. This is a field that is heavy in math and light on controlled experiments. In some ways, its methodology is the opposite of the dominant methodology of psychology or cancer research. It is all about doing math and writing code to predict the markets.

Yesterday on /r/algotrading, /u/chiefkul reported on his effort to reproduce 130+ papers about “predicting the stock market”. He coded them from scratch and found that “every single paper was either p-hacked, overfit [or] subsample[d] …OR… had a smidge of Alpha [that disappears with transaction costs]”.

There’s a replication crisis for you. Even the most pessimistic readings of the literature in psychology or medicine produce significantly higher levels of successful replication. So let’s dig in a bit.

How would finance make sense of this failer to replicate?

The first defence in finance is an ontological one: the market cares about research on the market. Specifically, if you find a winning strategy then others will copy you and the strategy will stop winning. In quantitative finance, this is given a special name: alpha-decay.

/u/chiefkul reports that “[e]very author that’s been publicly challenged about the results of their paper says it’s stopped working due to “Alpha decay” because they made their methodology public.”

In other words, the authors claim that the disappearance of their results is not regression to the mean due to p-hacking for a big alpha. Instead, it is an actual change in the market due to others using the strategy or feature they discovered and thus making their strategy obsolete.

This sort of defence is occasionally used in psychology as well. Some past result is so widely reported that the deception (or some other feature) that underlies the experiment is no longer possible and thus a real result disappeared.

On the surface, regression to the mean and alpha decay can seem difficult to tell apart. But regression-to-the-mean has an important time symmetry that alpha-decay doesn’t: we expect to see a regression to the mean if we extend our dataset either forward or backward in time. Alpha decay, on the other hand, should only appear after publication and so if we extend our historic data back in time to before what the paper trained on, we should still expect to see big alpha.

This is exactly what /u/chiefkul reports doing. They report: “For the papers that I could reproduce, all of them failed regardless of whether you go back or forwards [in time]”.

So the ontological defence seems suspect.

The second defence in finance is a sociological one: only bad strategies are published, if you had a good strategy then you would simply use it to make money instead of sharing it with the public. Or alternatively: academia consistently underperforms the ‘real world’. Any actually good strategy doesn’t become a paper, it becomes a hedge fund.

This is a convenient defence. The finance equivalent of security from obscurity.

This sociological defence certainly sounds plausible until we look at the real-world performance of actively managed funds. If we look back from 2019 at the performance on large-cap funds versus the S&P 500 then over the prior year, nearly 2/3rd of actively managed funds underperformed the S&P 500. Over the last 10 years, more than 8 in 10 underperformed. And over the last 15 years, around 9 in 10 actively managed large-cap funds underperformed the S&P 500.

So even if the best strategies are being turned into hedge funds, it certainly doesn’t seem that they provide consistent returns. This makes the sociological defence seems suspect.

At this point, we might start to to reflect on finance’s strained relationship with cross-validation. Or re-read Bailey et al. (2014) Pseudo-Mathematics and Financial Charlatanism. It is tempting to start to nod along with /u/chiefkul as they confirm our priors about the failures of finance.

But should we trust /u/chiefkul?

This is an important question in science as well: what does a failed replication tell us? How do we know that the replication was done well? With the excitement for failed re-plications, how can we be confident that the replicator itself doesn’t suffer from sampling bias or p-hacking?

A lot of this is addressed by doing careful open science and by pre-registration. By sharing our datasets and methods.

In the case of the claimed failed replication of 130+ papers, we have none of this. Other redditors, like /u/programmerChilli on /r/slatestarcodex or /u/spogett on /r/algotrading are skeptical of this replication effort. Given that most papers are poorly written most papers, most methods are underdescribed, and datasets are hard to get, it is a challenge for most people to reproduce even a single paper. Claiming to reproduce 130+ in 7 months is a stretch. Especially if when asked for their code, /u/chiefkul responds with comments like:

Honestly none of it is particularly good and it’s just a complete mess now. As I was trying to explain I kinda lumped papers together so I’d build a script for TA+ML for example and then I’d just keep editing it for every paper that was in that category.

As /u/spogett writes on the original post: “this post does not pass the smell test”.

But it is certainly attention-grabbing, both on Reddit and on twitter.

This should remind us of a final point to keep in mind when thinking about the replication crisis. If we are sceptical of how first-order studies might or might not replicate, we should also be sceptical of second-order replication studies. If we want to hold first-order studies to higher standards then we should hold the second-order to even higher standards. To lead by example.

I am not left in an awkward position. If /u/chiefkul’s post is a publicity stunt or just trolling then how can we learn anything from it? What was the point of reading /u/chiefkul’s post or worse yet, my post about their post? If in the end, we can be no more or less confident that there is or there isn’t a replication crisis in algorithmic trading then what can we possibly learn for how we approach the replication crises in science?

Did you just waste 10 minutes of your time by reading this post? Did I waste several hours by writing it?

I hope not.

As a tried to argue last week in the context of mathematics: it isn’t just the outcome state (theorem) that matters but the process (ideas and proof technique).

Even if /u/chiefkul’s report is a publicity stunt, by analyzing it and thinking about it, we can learn things about the replication crisis. What we learn in the process is valuable even if the finding of ‘almost all results in algorithmic trading don’t replicate’ turns out to be false. Of course, in the case of a publicity stunt, we learn less than we would have from genuine results. But either way, we don’t completely waste our time and efforts.

So what does this mean for introducing more mathematical modelling to fields like psychology or cancer research?

We need to be wary of mathematical models that only take accepted or expected empirical foundations and turn them into surprising results. Instead, we need to do work that can be useful even if its empirical grounding is shaky. We need to write papers that combine both a surprising conclusion based on the facts we currently believe and also extend of methodology or develop new techniques that can be useful even if those ground ‘facts’ turn out to be false.

We need to not only use mathematics and statistics to transform historical data into new predictions. But also develop new mathematics and statistics that can still be worth studying if that historic data turns out to be bad or the predictions aren’t realized. And we need to share these techniques in an open and accessible way.


Bailey, D., Borwein, J., de Prado, M.L., & Zhu, Q. (2014). Pseudo-Mathematics and Financial Charlatanism: The Effects of Backtest Overfitting on Out-of-Sample Performance. Notices of the American Mathematical Society, 61(5).

Begley, C. G., & Ellis, L. M. (2012). Drug development: Raise standards for preclinical cancer research. Nature, 483(7391): 531.

Camerer, C. F., Dreber, A., Forsell, E., Ho, T. H., Huber, J., Johannesson, M., … & Heikensten, E. (2016). Evaluating replicability of laboratory experiments in economics. Science, 351(6280): 1433-1436.

Hunter, J. E. (2001). The desperate need for replications. Journal of Consumer Research, 28(1): 149-158.

Ioannidis, J.P. (2005a). Contradicted and initially stronger effects in highly cited clinical research. JAMA, 294(2): 218-228.

Ioannidis, J. P. (2005b). Why most published research findings are false. PLoS Medicine, 2(8): e124.

Ioannidis, J. P. (2016). Why most clinical research is not useful. PLoS Medicine, 13(6): e1002049.

Stagge, J. H., Rosenberg, D. E., Abdallah, A. M., Akbar, H., Attallah, N. A., & James, R. (2019). Assessing data availability and research reproducibility in hydrology and water resources. Scientific Data, 6: 190030.


About Artem Kaznatcheev
From the Department of Computer Science at Oxford University and Department of Translational Hematology & Oncology Research at Cleveland Clinic, I marvel at the world through algorithmic lenses. My mind is drawn to evolutionary dynamics, theoretical computer science, mathematical oncology, computational learning theory, and philosophy of science. Previously I was at the Department of Integrated Mathematical Oncology at Moffitt Cancer Center, and the School of Computer Science and Department of Psychology at McGill University. In a past life, I worried about quantum queries at the Institute for Quantum Computing and Department of Combinatorics & Optimization at University of Waterloo and as a visitor to the Centre for Quantum Technologies at National University of Singapore. Meander with me on Google+ and Twitter.

One Response to Allegory of the replication crisis in algorithmic trading

  1. 17pretzels says:

    Hi Artem — I think you would be interested in the Replication Markets project, a prediction market in which forecasters bid on the outcomes of replications of 3000 claims in the social and behavioral sciences. The website is .

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: