Cross-validation in finance, psychology, and political science

A large chunk of machine learning (although not all of it) is concerned with predictive modeling, usually in the form of designing an algorithm that takes in some data set and returns an algorithm (or sometimes, a description of an algorithm) for making predictions based on future data. In terminology more friendly to the philosophy of science, we may say that we are defining a rule of induction that will tell us how to turn past observations into a hypothesis for making future predictions. Of course, Hume tells us that if we are completely skeptical then there is no justification for induction — in machine learning we usually know this as a no-free lunch theorem. However, we still use induction all the time, usually with some confidence because we assume that the world has regularities that we can extract. Unfortunately, this just shifts the problem since there are countless possible regularities and we have to identify ‘the right one’.

Thankfully, this restatement of the problem is more approachable if we assume that our data set did not conspire against us. That being said, every data-set, no matter how ‘typical’ has some idiosyncrasies, and if we tune in to these instead of ‘true’ regularity then we say we are over-fitting. Being aware of and circumventing over-fitting is usually one of the first lessons of an introductory machine learning course. The general technique we learn is cross-validation or out-of-sample validation. One round of cross-validation consists of randomly partitioning your data into a training and validating set then running our induction algorithm on the training data set to generate a hypothesis algorithm which we test on the validating set. A ‘good’ machine learning algorithm (or rule for induction) is one where the performance in-sample (on the training set) is about the same as out-of-sample (on the validating set), and both performances are better than chance. The technique is so foundational that the only reliable way to earn zero on a machine learning assignments is by not doing cross-validation of your predictive models. The technique is so ubiquotes in machine learning and statistics that the StackExchange dedicated to statistics is named CrossValidated. The technique is so…

You get the point.

If you are a regular reader, you can probably induce from past post to guess that my point is not to write an introductory lecture on cross validation. Instead, I wanted to highlight some cases in science and society when cross validation isn’t used, when it needn’t be used, and maybe even when it shouldn’t be used.

A good first stop for looking at prediction is finance. Markets provide us with the perfect opportunity to look at a very complicated system that we understand poorly, but where we can relatively easily quantify the success of predictions as amount of money made trading, and have a natural data source in the form of price data. It is also relatively natural to look through this historic data to try to find patterns we can exploit to beat the market and consistently make money by shorting stocks our model predicts to depreciate and longing the ones it predicts to rise in value.

A simple technique we can use is to look at properties of recent prices and some other easily accessible indicators (time of day, week, month, etc) and come up with a set of possible trading strategies. We can then look at the historic data and figure out which trading strategy is best. Unfortunately, the potential patterns we are looking for have very small effect, and the number of potential strategies we consider is usually very large. In this case, it is pretty clear cut that we should use cross-validation to calibrate our model, unfortunately many investors don’t do this, instead simply picking the strategy that is best according to back-testing without any out-of-sample validation (Bailey et al., 2014). The result is strategies that look great on historic data, but perform at chance levels on new data.

This is bad news if you trusted a financial charlatan, but to some extent it should not be much of a surprise. What really concerned me, is that Bailey et al. (2014) suggest that most academics working on financial prediction also don’t perform any out-of-sampling tests on the supposed ‘patterns’ they see in the market. Of course, without out of sample testing and if a lot of potential strategies are considered then the results cannot be expected to be capturing regularities of the market, but are instead just idiosyncrasies of the particular data set. This means that if you actually implement one of these published trading strategies then you are likely to get chance results. No wonder belief in the efficient markets hypothesis is so prevalent, bad statistics just acts as a constant source of confirmation bias.

Unfortunately, there does not appear to be a straightforward path to overcoming this lack of out-of-sample testing. The roadblock is a lack of trust, the publishing cultures incentivizes strong clear effects and that makes other researchers not trust cross-validation because it relies on the researchers selecting their hypothesis classes, and training-to-validation partitions honestly. This can be partially overcome by making the statistical code and data sets open, but it still doesn’t mitigate tweeking the research-degrees of freedom in selecting hypotheses classes to game the validation (kind of how you can use grue and bleen to game induction). A cleaner way to circumvent this is with a culture of conference or working papers, by publishing your preliminary model in a conference or archived working paper, you can use future data over the coming months as a source of out-of-sample validation. For example, Leinweber and Sisk (2011) pursued this strategy by presenting their investment strategy at a conference and announcing that six months later they would publish their results with as-of-then unobserved pure out-of-sample data.

The fact that the same people published the original model and then also ‘reproduced’ its results on out-of-sample data is in some way incidental. You could have also had two separate teams — which can even be more advantageous when the raw data is highly interpretable or heavily theory-laden — and the more general phenomena of reproducibility; a lack of which plagues more fields than just finance. I’ve highlighted an analogous issue in psychology before, the so-called replicability crises.

The replicability, community wide approach is for a team to develop a hypothesis using any techniques available, and leave the out-of-sample testing to future researchers. Sometimes a few extra constraints are added through study registration and the explicit declaration of what statistical methods will be used. These methods are a great way to validate the results of any given study, but that level of analysis misses part of the point. The fundamental issue with the finance examples was testing too many hypotheses, to create a similar situation, we need to look at the level of the whole community.


Consider a drug trial, suppose that testing one drug on 1000 patients is enough to establish if it is effective or not to a p < 0.01 confidence. For any single group, this is sufficiently stringent, but for a big drug company could easily fund 100s of studies and find ‘results’ in noise. Thus, individual validation is not always sufficient, some sort of mechanisms have to exist on the level of the whole research community.

I am not sure what structural features need to be changed to avoid this. Publishing negative results is a good start but will not go that far toward solving the problem. I suspect that the fundamental issue is that these sort of statistical sciences like psychology and biomedicine, are just institutional equivalents of blind data. Without working towards a theoretical grounding and understanding, it is not possible to run experiments to test edge-cases; in machine learning, it might be equivalent to only having samples from the distribution of inputs instead of also being able to ask for specific ones, which we know is a much weaker model of learning (Angluin, 1987). Of course, to even get this off the ground, there needs to be a culture of progressive knowledge, results that build on top of previous results instead of just using the methods of previous groups to get one-time results that no other group wants to look at unless there is some external incentive placed on them to reproduce the previous study; although that is also not progressive, just repetitive.

There is a further possibility, Popper (1959) argues that the ideal of prediction is unreasonable in most settings. In the context of political science, he writes:

[L]ong term prophecies can be derived from scientific conditional predictions only if they apply to systems which can be described as well isolated, stationary, and recurrent. These systems are very rare in nature; and modern society is surely not one of them.

The above view is popular among many social scientists, and some variants of the efficient market hypothesis can be seen as finance’s variant of it. In both cases there are counterexamples showing that some effects are recurrent. For political science and history, Turchin’s (2008) cliodynamics shows, for example, recurrence in the dynamics of in-equality and well-being. For such situations, Jay Ulfelder makes a good case for introducing more out-of-sample testing into social modeling. In finance, we can turn to robust historical results on trend-following strategies as evidence against efficient markets (Szakmary et al., 2010; Lemperiere, 2014). Of course, in both cases we should still keep an eye out for overfitting.

However, such examples don’t make me disagree with Popper, at least not completely. If Popper was to replace ‘systems’ by ‘theoretical terms used to describe systems’ then I would strongly agree with him. I think there the fundamental feature of ‘science success stories’ in not (just) prediction but the selection of our terms, language, and what we aim to predict. There is no reason to believe that the folk or common-sense variables are necessarily important. Our folk theory of mind is tuned to predicting the behavior of a few individuals of our tribe on a short time horizon, why would we expect the theoretical constructs of this system to be good for describing or predicting the long term dynamics of large scale societies? Look at physics for example, theoretical terms don’t resemble our common sense notions at all; sometimes we can even prove that our common sense terms are inapplicable.

At times when a field of study is searching for its language of theoretical terms, I think that focusing on prediction can actually be counter-productive and should be avoided. Imagine if physicists were concerned with predicting the shape or the color or the saltiness of an individual electron. From a folk theory perspective it seems like potentially reasonable question to ask, but in reality these are pseudo-questions that would just lead physics astray into silly theories.

I believe that think political science, and to some extent psychology is not yet at the stage where it can recognize pseudo-questions from “meaningful” questions. I think it is still searching for the terms and relevant variables, a way to ask questions that are guaranteed or likely to have answers. When a science is in this early stage, it is better to try to avoid accumulation and explore freely until we accidentally stumble on a useful language.

I guess I am advocating — as Feyerabend does — for a plurality of approaches and no restrictions on method, demands for accumulations, or prediction with cross-validation. At least not until we have reasonable evidence to believe that prediction will lead us somewhere that we want to go.


Angluin, D. (1987). Learning regular sets from queries and counterexamples. Information and Computation, 75: 87-106

Bailey, D., Borwein, J., de Prado, M.L., & Zhu, Q. (2014). Pseudo-Mathematics and Financial Charlatanism: The Effects of Backtest Overfitting on Out-of-Sample Performance Notices of the American Mathematical Society, 61 (5) DOI: 10.1090/noti1105

Leinweber, D., & Sisk, J. (2011). Event-Driven Trading and the “New News”. The Journal of Portfolio Management, 38(1), 110-124.

Lempérière, Y., Deremble, C., Seager, P., Potters, M., & Bouchaud, J. P. (2014). Two centuries of trend following. arXiv preprint arXiv:1404.3274.

Popper, K.R. (1959). Prediction and prophecy in the social sciences. Theories of History, 276-285.

Szakmary, A. C., Shen, Q., & Sharma, S. C. (2010). Trend-following trading strategies in commodity futures: A re-examination. Journal of Banking & Finance, 34(2), 409-426.

About Artem Kaznatcheev
From the Department of Computer Science at Oxford University and Department of Translational Hematology & Oncology Research at Cleveland Clinic, I marvel at the world through algorithmic lenses. My mind is drawn to evolutionary dynamics, theoretical computer science, mathematical oncology, computational learning theory, and philosophy of science. Previously I was at the Department of Integrated Mathematical Oncology at Moffitt Cancer Center, and the School of Computer Science and Department of Psychology at McGill University. In a past life, I worried about quantum queries at the Institute for Quantum Computing and Department of Combinatorics & Optimization at University of Waterloo and as a visitor to the Centre for Quantum Technologies at National University of Singapore. Meander with me on Google+ and Twitter.

8 Responses to Cross-validation in finance, psychology, and political science

  1. Nice post… i would add that depending on what you are doing with the data, even “in-sample” cross-validation may be important. The case in point: if you assume your hypothesis class to be made of ergodic stationary processes, and learn a generative model from some given data; and the data deviates significantly from ergodicity and/or stationarity, then one should be able to detect this with in-sample validation itself. Even if the basic assumptions remain true, one should be able to use the inferred model to generate new data (generative model), and test if this generation matches the observations in some well-defined stochastic sense. Unfortunately, most parametric models are “force-fitted” to the data at hand, making such in-sample validation impossible. The reason why this is interesting, is that it is simpler, when the framework allows for it. Out-of-sample validation may fail simply because that the process itself changed… think of finding the magically perfect financial predictor on data upto 2006, and then trying to cross-validate when the economy started tanking in 2008.

  2. Pingback: Cross-validation in finance, psychology, and political science: dangers of historic backtest overfitting. « Economics Info

  3. Pingback: Weekly Grist 4/26/2014 | open hand techniques

  4. leeslutz says:

    Reblogged this on Oracular.

  5. Pingback: Cataloging a year of blogging: the philosophical turn | Theory, Evolution, and Games Group

  6. Pingback: Cytokine storms during CAR T-cell therapy for lymphoblastic leukemia | Theory, Evolution, and Games Group

  7. Pingback: Description before prediction: evolutionary games in oncology | Theory, Evolution, and Games Group

  8. Pingback: Allegory of the replication crisis in algorithmic trading | Theory, Evolution, and Games Group

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: