Cross-validation in finance, psychology, and political science

A large chunk of machine learning (although not all of it) is concerned with predictive modeling, usually in the form of designing an algorithm that takes in some data set and returns an algorithm (or sometimes, a description of an algorithm) for making predictions based on future data. In terminology more friendly to the philosophy of science, we may say that we are defining a rule of induction that will tell us how to turn past observations into a hypothesis for making future predictions. Of course, Hume tells us that if we are completely skeptical then there is no justification for induction — in machine learning we usually know this as a no-free lunch theorem. However, we still use induction all the time, usually with some confidence because we assume that the world has regularities that we can extract. Unfortunately, this just shifts the problem since there are countless possible regularities and we have to identify ‘the right one’.

Thankfully, this restatement of the problem is more approachable if we assume that our data set did not conspire against us. That being said, every data-set, no matter how ‘typical’ has some idiosyncrasies, and if we tune in to these instead of ‘true’ regularity then we say we are over-fitting. Being aware of and circumventing over-fitting is usually one of the first lessons of an introductory machine learning course. The general technique we learn is cross-validation or out-of-sample validation. One round of cross-validation consists of randomly partitioning your data into a training and validating set then running our induction algorithm on the training data set to generate a hypothesis algorithm which we test on the validating set. A ‘good’ machine learning algorithm (or rule for induction) is one where the performance in-sample (on the training set) is about the same as out-of-sample (on the validating set), and both performances are better than chance. The technique is so foundational that the only reliable way to earn zero on a machine learning assignments is by not doing cross-validation of your predictive models. The technique is so ubiquotes in machine learning and statistics that the StackExchange dedicated to statistics is named CrossValidated. The technique is so…

You get the point.

If you are a regular reader, you can probably induce from past post to guess that my point is not to write an introductory lecture on cross validation. Instead, I wanted to highlight some cases in science and society when cross validation isn’t used, when it needn’t be used, and maybe even when it shouldn’t be used.
Read more of this post

Big data, prediction, and scientism in the social sciences

Much of my undergrad was spent studying physics, and although I still think that a physics background is great for a theorists in any field, there are some downsides. For example, I used to make jokes like: “soft isn’t the opposite of hard sciences, easy is.” Thankfully, over the years I have started to slowly grow out of these condescending views. Of course, apart from amusing anecdotes, my past bigotry would be of little importance if it wasn’t shared by a surprising number of grown physicists. For example, Sabine Hossenfelder — an assistant professor of physics in Frankfurt — writes in a recent post:

If you need some help with the math, let me know, but that should be enough to get you started! Huh? No, I don't need to read your thesis, I can imagine roughly what it says.It isn’t so surprising that social scientists themselves are unhappy because the boat of inadequate skills is sinking in the data sea and physics envy won’t keep it afloat. More interesting than the paddling social scientists is the public opposition to the idea that the behavior of social systems can be modeled, understood, and predicted.

As a blogger I understand that we can sometimes be overly bold and confrontational. As an informal medium, I have no fundamental problem with such strong statements or even straw-men if they are part of a productive discussion or critique. If there is no useful discussion, I would normally just make a small comment or ignore the post completely, but this time I decided to focus on Hossenfelder’s post because it highlights a common symptom of interdisciplinitis: an outsider thinking that they are addressing people’s critique — usually by restating an obvious and irrelevant argument — while completely missing the point. Also, her comments serve as a nice bow to tie together some thoughts that I’ve been wanting to write about recently.
Read more of this post

Liquidity hoarding and systemic failure in the ecology of banks

As you might have guessed from my recent posts, I am cautious in trying to use mathematics to build insilications for predicting, profiting from, or controlling financial markets. However, I realize the wealth of data available on financial networks and interactions (compared to similar resources in ecology, for example) and the myriad of interesting questions about both economics and humans (and their institutions) more generally that understanding finance can answer. As such, I am more than happy to look at heuristics and other toy models in order to learn about financial systems. I am particularly interested in understanding the interplay between individual versus systemic risk because of analogies to social dilemmas in evolutionary game theory (and the related discussions of individual vs. inclusive vs. group fitness) and recently developed connections with modeling in ecology.

Three-month Libor-overnight Interest Swap based on data from Bloomberg and figure 1 of Domanski & Turner (2011). The vertical line marks 15 September 2008 -- the day Lehman Brothers filed for bankruptcy.

Three-month Libor-overnight Interest Swap based on data from Bloomberg and figure 1 of Domanski & Turner (2011). The vertical line marks 15 September 2008 — the day Lehman Brothers filed for bankruptcy.

A particular interesting phenomenon to understand is the sudden liquidity freeze during the recent financial crisis — interbank lending beyond very short maturities virtually disappeared, three-month Libor (a key benchmarks for interest rates on interbank loans) skyrocketed, and the world banking system ground to a halt. The proximate cause for this phase transition was the bankruptcy of Lehman Brothers — the fourth largest investment bank in the US — at 1:45 am on 15 September 2008, but the real culprit lay in build up of unchecked systemic risk (Ivashina & Scharfstein, 2010; Domanski & Turner, 2011; Gorton & Metrick, 2012). Since I am no economist, banker, or trader, the connections and simple mathematical models that Robert May has been advocating (e.g. May, Levin, & Sugihara (2008)) serve as my window into this foreign land. The idea of a good heuristic model is to cut away all non-essential features and try to capture the essence of the complicated phenomena needed for our insight. In this case, we need to keep around an idealized version of banks, their loan network, some external assets with which to trigger an initial failure, and a way to represent confidence. The question then becomes: under what conditions is the initial failure contained to one or a few banks, and when does it paralyze or — without intervention — destroy the whole financial system?
Read more of this post

Mathematics in finance and hiding lies in complexity

Sir Andrew Wiles

Sir Andrew Wiles

Mathematics has a deep and rich history, extending well beyond the 16th century start of the scientific revolution. Much like literature, mathematics has a timeless quality; although its trends wax and wane, no part of it becomes out-dated or wrong. What Diophantus of Alexandria wrote on solving algebraic equations in the 3rd century was still as true in the 16th, 17th, or today. In fact, it was in 1637 in the margins of Diophantus’ Arithmetica that Pierre de Fermat scribbled the statement of his Last Theorem. that the margin was too narrow to contain[1]. In modern notation it is probably one of the most famous Diophantine equations a^n + b^n = c^n with the assertion that it has no solutions for n > 2 and a,b,c as positive integers. A statement that almost anybody can understand, but one that is far from easy to prove or even approach[2].
Read more of this post

Randomness, necessity, and non-determinism

If we want to talk philosophy then it is necessary to mention Aristotle. Or is it just a likely beginning? For Aristotle, there were three types of events: certain, probable, and unknowable. Unfortunately for science, Aristotle considered the results of games of chance to be unknowable, and probability theory started — 18 centuries later — with the analysis of games of chance. This doomed much of science to an awkward fetishisation of probability, an undue love of certainty, and unreasonable quantification of the unknowable. A state of affairs that stems from our fear of admitting when we are ignorant, a strange condition given that many scientists would agree with Feynman’s assessment that one of the main features of science is acknowledging our ignorance:

Unfortunately, we throw away our ability to admit ignorance when we assign probabilities to everything. Especially in settings where there is no reason to postulate an underlying stochastic generating process, or a way to do reliable repeated measurements. “Foul!” you cry, “Bayesians talk about beliefs, and we can hold beliefs about single events. You are just taking the dated frequentist stance.” Well, to avoid the nonsense of the frequentist vs. Bayesian debate, let me take literately the old adage “put your money where you mouth is” and use the fundamental theorem of asset pricing to define probability. I’ll show an example of a market we can’t price, and ask how theoretical computer science can resolve our problem with non-determinism.
Read more of this post

Black swans and Orr-Gillespie theory of evolutionary adaptation

The internet loves fat tails, it is why awesome things like wikipedia, reddit, and countless kinds of StackExchanges exist. Finance — on the other hand — hates fat tails, it is why VaR and financial crises exist. A notable exception is Nassim Taleb who became financially independent by hedging against the 1987 financial crisis, and made a multi-million dollar fortune on the recent crisis; to most he is known for his 2007 best-selling book The Black Swan. Taleb’s success has stemmed from his focus on highly unlikely events, or samples drawn from far on the tail of a distribution. When such rare samples have a large effect then we have a Black Swan event. These are obviously important in finance, but Taleb also stresses its importance to the progress of science, and here I will sketch a connection to the progress of evolution.
Read more of this post

Machine learning and prediction without understanding

Big data is the buzzword du jour, permeating from machine learning to hadoop powered distributed computing, from giant scientific projects to individual social science studies, and from careful statistics to the witchcraft of web-analytics. As we are overcome by petabytes of data and as more of it becomes public, it is tempting for a would-be theorist to simply run machine learning and big-data algorithms on these data sets and take the computer’s conclusions as understanding. I think this has the danger of overshadowing more traditional approaches to theory and the feedback between theory and experiment.
Read more of this post

Individual versus systemic risk in asset allocation

Proponents of free markets often believe in an “invisible hand” that guides an economic system without external controls like government regulations. Therefore a highly efficient economic equilibrium can be created if all market participants act purely out of self-interest. In the paper titled “Individual versus systemic risk and the Regulator’s Dilemma.”, Beale et al. (2011) applied agent-based simulations to show that a system of financial institutions attempting to minimize their own risk of failure may not minimize the risk of failure of the entire system. In addition, the authors have suggested several ways to limit the financial institutions in order to lower the risk of failure for the financial system. Their suggestion responds directly to the regulatory challenges during the recent financial crisis where failures of some institutions have endangered the financial system and even the global economy.

Finance regulation

It’s easy to get tangled up trying to regulate banks.

To illustrate the point of individual optimality versus the system optimality, the paper makes simple assumptions of the financial system and its participants. In a world of N independent banks and M assets, each of the N banks seeks to invest its resources into these M assets from time 0 to time 1. The M returns on assets are assumed to be independently and identically distributed following a student’s t-distribution with a degree of freedom of 1.5. If a bank’s loss exceeds a certain threshold, it fails. Due to this assumption, each bank’s optimal allocation (to minimize it’s chance of failure) is to invest equally in each asset.

However, a regulator is concerned with the failure of the financial system instead of the failure of any individual bank. To incorporate this idea, the paper suggests a cost function for the regulator: c = k^s where k is the number of failed banks. This cost function is the only coupling between banks in the model. If s > 1, this cost function implies each additional bank failure “costs” the system more (one can tell by taking the derivative s \cdot k^{(s-1)}, which is an increasing function in k if s > 1). As s increases from 1, the systematic optimal allocation of all the banks starts to deviate further away from the individual optimal allocation of the banks. When s is 2, the systematic optimal allocation for each bank is to invest entirely in one asset, a drastic contrast to the individual optimal allocation (investing equally in each asset). In this situation, the safest investment allocation for the system leads to the riskiest investment allocation for the individual bank.

While the idea demonstrated above is interesting, the procedure is unnecessarily complex. The assumption of student t distribution with degree of freedom of 1.5 is far too broad of an assumption for the distribution of financial assets. However, the distribution does not have the simplicity of Bernoulli or Gaussian distributions to arrive at analytical solutions (See Artem’s question on toy models of asset returns for more discussion). One simple example would be bonds whose principal and coupon payments of the bond are either paid to the bondholder in full or partially paid in the event of a default. Therefore bond is not close to a t-distribution. Other common assets such as mortgages and consumer loans are not t-distribution either. Therefore the assumption of t-distribution does not come close to capturing the probabilistic nature of many major financial assets. The assumption of t-distribution does not provide any additional accuracy to simpler assumptions of Gaussian or Bernoulli distributions. Assumption of Gaussian distribution or Bernoulli distributions, on the other hand, is at least capable of providing analytical solutions without the tedious simulations.

The authors define two parameters D and G in an attempt to constrain the banks to have systematically optimal allocations. D denotes the average distance of asset allocations between each pair of banks. G denotes the distance between the average allocations across banks and the individual optimal allocation. When s is increasing from 1, it was found that bank allocations with a higher D and a near-zero G are best for the system. To show the robustness of these two parameters, the authors varied other parameters such as number of assets, number of banks, the distributions of the assets, correlation between the assets, and the form of the regulator’s cost function. They found lower systematic risk for the banking system by enforcing a near zero G and higher D. This result implies that the banks should concentrate in their own niche of financial assets, but the aggregate system should still have optimal asset allocations.

Based on the paper, it may appear that the systematic risk of failure can be reduced in the financial system by controlling for parameters D and G (though without analytical solutions). Such controls have to be enforced by an omnipotent “regulator” with perfect information on the exact probabilistic nature of the financial products and the individual optimal allocations. Moreover, this “regulator” must also have unlimited political power to enforce its envisioned allocations. This is far from reality. Financial products are different in size and riskiness, and there is continuous creation of new financial products. Regulators such as the Department of Treasury, SEC, and Federal Reserve also have very limited political power. For example, these regulators were not legally allowed to rescue Lehman Brothers whose failure led to the subsequent global credit and economic crisis. The entire paper can be boiled down to one simple idea: optimal actions for the individuals might not be optimal for the system, but if there is an all-powerful regulator who forces the individuals to act optimally for the system, the system will be more stable. This should come as no surprise to regular readers of this blog, since evolutionary game theory deals with this exact dilemma when looking for cooperation. This main result is rather trivial, but opens ideas for more realistic simulations. One idea would be to remove or weaken the element of “regulator” and add incentives for banks to act more systematically optimal. It would be interesting to look at how banks can act under these circumstances and whether or not their actions can lead to a systematically optimal equilibrium.

One key aspect of a systematic failure is not the simultaneous failure of many assets. There are two important aspects of bank operations. Banks operate by taking funding from clients to invest in riskier assets. This operation requires strong confidence in the bank’s strength to avoid unexpected withdrawals or the ability to sell these assets to pay back the clients. Secondly banks obtain short-term loans from each other by putting up assets as collateral. This connects the banks more strongly than a simple regulator cost function, creating a banking ecosystem.

The strength of the inter-bank connections depends on value of the collateral. In the event of catastrophic losses of subprime loans by some banks, confidence in these banks are shaken and the value of assets start to come down. A bank’s clients may start to withdraw their money and the bank sells its assets to meet its clients’ demands, further depressing the prices of the assets sometimes leading to a fire sale. Other banks would start asking for more and higher quality collateral due to the depressed prices from the sell-off. The bank’s high-quality assets and cash may subsequently become strained leading to further worry about the bank’s health and more client withdrawals and collateral demands. Lack of confidence in one bank’s survival leads to worries about the other banks that have lent to that bank triggering a fresh wave of withdrawals and collateral demands. Even healthy banks can be ruined in a matter of days by a widespread panic.

As a result, the inter-bank dealings are instrumental in the event of systematic failure. Beale et al. (2011) intentionally sidestepped this inter-bank link to arrive at their result purely from a perspective of asset failure. But, the inter-bank link was the most important factor in creating mass failure of the financial system. It is because of this link that failure of one asset (subprime mortgages) managed to nearly bring down the financial systems in the entire developed world. Not addressing the inter-bank link is simply not addressing the financial crisis at all.

ResearchBlogging.orgBeale N., Rand D.G., Battey H., Croxson K., May R.M. & Nowak M.A. (2011). Individual versus systemic risk and the Regulator’s Dilemma, Proceedings of the National Academy of Sciences, 108 (31) 12647-12652. DOI:

Mathematical models in finance and ecology

Theoretical physicists have the reputation of an invasive species — penetrating into other fields and forcing their methods. Usually these efforts simply irritate the local researchers, building a general ambivalence towards field-hopping physicists. With my undergraduate training primarily in computer science and physics, I’ve experienced this skepticism first hand. During my time in Waterloo, I tried to supplement my quantum computing work by engaging with ecologists. My advances were met with a very dismissive response:

But at the risk of sounding curmudgeonly, it is my experience that many folks working in physics and comp sci are more or less uninformed regarding the theoretical ecology, and tend to reinvent the wheel.

On rare occasion though, a theorist will move into a field of sledges & rollers, and help introduce the first wheel. This was the case 40 years before my ill-fated courtship of Waterloo ecologists, when Robert May published “Stability in multispecies community models” (1971) and transitioned from theoretical physics (PhD 1959, University of Sydney) to ecology. He helped transform the field from shunning equations to a vibrant community of observation, experiments, and mathematical models.

Lord Robert May of Oxford.

Lord Robert May of Oxford. Photo is from the donor’s page of Sydney High School Old Boys Union where he attended secondary school.

Robert M. May, Lord May of Oxford, is a professor in the Department of Zoology at University of Oxford. I usually associate him with two accomplishments inspired by (but independent of) ecology. First, he explored the logistic map x_{t + 1} = r x_t(1 - x_t) and its chaotic behavior (May, 1976), becoming one of the co-founders of modern chaos theory. Although the origins of chaos theory can be traced back to another great cross-disciplinary scholar — Henri Poincaré; it wasn’t until the efforts of May and colleagues in the 1970s that the field gained significant traction outside of mathematics and gripped the popular psyche. Second, he worked with his post-doc Martin A. Nowak to popularize the spatial Prisoner’s Dilemma and computer simulation as an approach to the evolution of cooperation (Nowak & May, 1992). This launched the sub-field that I find myself most comfortable in and stressed the importance of spatial structure in EGT. May is pivoting yet again, he is harnessing his knowledge of ecology and epidemiology to study the financial ecosystem (May, Levin, & Sugihara, 2008).

After the 2008 crises, finance became a hot topic for academics and May, Levin, & Sugihara (2008) suggested mathematical ecology as a source of inspiration. Questions of systemic risk, or failure of the whole banking system (as opposed to a single constituent bank), grabbed researchers’ attention. In many ways, these questions were analogous to the failure of ecosystems. In fisheries research there was a similar history to that of finance. Early research on fisheries would fixate on single species, the equivalent of a bank worrying only about its own risk-management strategy. However, the fishes were intertwined in an ecological network like banks are connected through an inter-bank loan network. The external stresses fish species experiences were not independent, something like a change in local currents or temperature would effect many species at once. Analogously, the devaluation of an external asset class like the housing market effects many banks at once. As over-consumption depleted fisheries in spire of ecologists’ predictions, the researchers realized that they must switch to a holistic view; they switched their attention to the whole ecological network and examined how the structure of species’ interactions could aid or hamper the survival of the ecosystem. Regulators have to view systemic risk in financial systems through the same lens by considering a holistic approach to managing risk.

Once a shock is underway, ideas from epidemiology can help to contain it. As one individual becomes sick, he has the risk of passing on that illness to his social contacts. In finance, if a bank fails then the loans it defaulted on can cause its lenders to fail and propagate through the inter-bank loan network. Unlike engineered networks like electrical grids, an epidemiologist does not have control over how humans interact with each other, she can’t design our social network. Instead, she has to deter the spread of disease through selective immunization or through encouraging behavior that individuals in the population might or might not adopt. Similarly, central bankers cannot simply tell all other banks who to loan to, instead they must target specific banks for intervention (say through bail-out) or by implementing policies that individual banks might or might not follow (depending on how these align with their interests). The financial regulator can view bank failure as a contagion (Gai & Kapadia, 2010) and adapt ideas from public health.

The best part of mathematical models is that the preceding commonalities are not restricted to analogy and metaphor. May and colleagues make these connections precise by building analytic models for toy financial systems and then using their experience and tools from theoretical ecology to solve these models. Further, the cross-fertilization is not one-sided. In exchange for mathematical tools, finance provides ecology with a wealth of data. Studies like the one commissioned by the Federal Reserve Bank of New York (Soramäki et al., 2007) can look at the interaction of 9500 banks with a total of 700000 transfers to reveal the topology of inter-bank payment flows. Ecologists can only dream of such detailed data on which to test their theories. For entertainment and more information, watch Robert May’s hour-long snarky presentation of his work with Arinaminpathy, Haldane, and Kapadia (May & Arinaminpathy 2010; Haldane & May, 2011; Arinaminpathy, Kapadia, & May, 2012) during the 2012 Stanislaw Ulam Memorial Lectures at the Santa Fe Institute:


Arinaminpathy, N., Kapadia, S., & May, R. M. (2012). Size and complexity in model financial systems. Proceedings of the National Academy of Sciences, 109(45), 18338-18343.

Gai, P., & Kapadia, S. (2010). Contagion in financial networks. Proceedings of the Royal Society A: Mathematical, Physical and Engineering Science, 466(2120), 2401-2423.

Haldane, A. G., & May, R. M. (2011). Systemic risk in banking ecosystems. Nature, 469(7330), 351-355.

May, R. M. (1971). Stability in multispecies community models. Mathematical Biosciences, 12(1), 59-79.

May, R. M. (1976). Simple mathematical models with very complicated dynamics. Nature, 261(5560), 459-467.

May RM, Levin SA, & Sugihara G (2008). Ecology for bankers. Nature, 451 (7181), 893-5 PMID: 18288170

May, R. M., & Arinaminpathy, N. (2010). Systemic risk: the dynamics of model banking systems. Journal of the Royal Society Interface, 7(46), 823-838.

Nowak, M. A., & May, R. M. (1992). Evolutionary games and spatial chaos. Nature, 359(6398), 826-829.

Soramäki, K., Bech, M. L., Arnold, J., Glass, R. J., & Beyeler, W. E. (2007). The topology of interbank payment flows. Physica A: Statistical Mechanics and its Applications, 379(1), 317-333.