Cancer, bad luck, and a pair of paradoxes
April 4, 2015 20 Comments
Among the highlights of my recent visit to IMO were several stimulating discussions with Artem Kaznatcheev. I’m still thinking over my response to his recent post about reductionist versus operationalist approaches in math biology, which is very relevant to some of my current research. Meanwhile, at Artem’s suggestion, this post will discuss a reanalysis of the “cancer and bad luck” paper that spurred so many headlines at the start of this year. Whereas many others have written critiques of that paper’s statistical methods and interpretations, my colleagues and I instead tried fitting alternative models to the underlying data. We thus found ourselves revisiting a couple of celebrated scientific paradoxes.
To start this post, I will introduce you to Simpson’s paradox and Peto’s paradox. With these pair of paradoxes in mind, we’ll turn a critical eye to Tomasetti & Vogelstein (2015), and I will explain our reanalysis of their data set.
Edward H Simpson has several claims to fame. He began his statistical career as a wartime code breaker at Bletchley Park, applying Bayesian methods to speed up the deciphering of German and Japanese ciphers (for the history, see Simpson, 2010). In ecology, he is noted for introducing a widely-used measure of diversity (Simpson, 1949). After the war he became a distinguished civil servant. But perhaps Simpson’s best known achievement is a paper (Simpson, 1951) he wrote as a graduate student, characterising a puzzling quirk of statistics that now bears his name.
To understand the phenomenon, consider the relationship between the numbers of heads (H) and tails (T) observed in experiments, where each experiment comprises N = 100, 200, or 300 coin tosses. Across the whole data set, we’d expect to see a positive correlation between H and T because the expected values of each variable scale with N (as illustrated in the figure at right). However, for each distinct value of N we have H = N – T. Therefore if we split the data into three subsets according to sample size (N = 100, N = 200, and N = 300) then H and T will be negatively correlated within each subset (the differently coloured groups in the figure at right). This is Simpson’s paradox: an observed trend in data can change and even reverse when the data are split into subsets, leading to all sorts of counterintuitive results.
Another English statistician, Sir Richard Peto, gave his name to our second paradox, which concerns cancer incidence across species (something that Artem has touched on before; for original paper, see Peto et al., 1975; Peto, 1977). It is thought that cancer most commonly arises due to DNA modifications during cell division (specifically, stem cell division). Therefore we might expect large animals to have many more tumours than small ones, simply due to the vast difference in total number of cells and therefore in cell divisions per body. Scaling up from human cancer rates, we’d expect blue whales to be overwhelmed by cancer. Peto’s paradox is the observation that, at the species level, there appears to be no such correlation between the incidence of cancer and the number of organismal cells. Evolutionary theory provides a possible solution to this paradox, because selection for cancer-suppressing traits should be stronger in larger organisms.
Thus primed in paradoxes, let’s return to the work of Tomasetti & Vogelstein (2015). The idea behind their study is appealingly simple. As already mentioned, it’s thought that cancer usually starts during stem cell division. Consistent with this theory, Tomasetti & Vogelstein (2015) found a correlation between cancer incidence per tissue and the lifetime number of stem cell divisions within the tissue. For example, stem cells in the brain seldom divide and brain tumours are generally rare, whereas the colon has a high turnover of stem cells and is more commonly afflicted by cancer.
Two observations inspired myself, my boss Michael Hochberg, and our colleague Oliver Kaltz to reanalyse this data set. First, if every stem cell division carries the same risk of seeding cancer then the slope of the correlation should be 1. In other words, doubling the number of divisions should double the risk of cancer. However, the correlation identified by Tomasetti & Vogelstein (2015) has a slope of only ~0.5 (on the log-log scale). Second, the data set is not representative of all cancer types: breast and prostate cancers are omitted (due to lack of reliable measurements), whereas other cancers such as osteosarcoma (bone cancer) are seemingly overrepresented. Such sample bias could skew the analysis.
Our response to these issues was to change the statistical model. Instead of fitting a regression model to all cancer types together, we subsetted the data by anatomical site (bone, thyroid, pancreas, etc.). Thus we divided the variation in cancer risk into two parts: between-group variation associated with anatomical site, and within-group variation associated with number of stem cell divisions (see figure at right). Our hypothesis was that the correlation within groups might differ from the correlation across the whole data set. In other words, we predicted that the location of a cancer within the body might be one of the “lurking explanatory variables” that underlie Simpson’s paradox.
As suspected, the subsets revealed a very different pattern compared to the combined data (see figure at right). Within each of the groups defined by anatomical site, the slope of the correlation between cancer risk and lifetime number of stem cell divisions is approximately 1, exactly as predicted by Tomasetti & Vogelstein’s biological hypothesis. However, there is also large variation between subsets, which means that the cancer risk per stem cell division varies enormously depending on anatomical site. For example, our results suggest that a stem cell division is ~10,000 times more likely to lead to cancer if it occurs in bone or in the thyroid than in the small intestine.
And here is where we find the connection to the second paradox. Just as in differently sized species, we suggest that anatomical sites with high numbers of stem cell divisions have evolved more powerful anti-cancer mechanisms. The lining of the colon is constantly renewing itself through stem cell divisions, so we would expect the colon to be especially good at lessening the risk of tumourogenesis per division. Conversely, bone stem cell divisions are rare, so we would not expect bones to invest so heavily in cancer prevention. Although the risk of each cancer type correlates with the lifetime number of stem cell divisions within each anatomical site, there is no such correlation between anatomical sites. Instead, the risks of the most common cancers of each anatomical site appear to saturate at around 1%. Similarly, there may be a correlation between cancer risk and body size within species (e.g. dogs and humans) but not between species. It seems that Peto’s paradox applies to anatomical sites just as well as it applies to species.
In summary, by applying an only slightly more complex statistical model, and by viewing our results in the light of evolutionary theory, we have obtained quite different results from Tomasetti & Vogelstein’s data, adding to the understanding of carcinogenesis and the evolution of cancer control. You can find out more by downloading our arXiv preprint (Noble et al., 2015). Comments are especially welcome now as we’re preparing to submit this work to a journal soon.
[Editor’s note: this work has now appeared as: Noble, R., Kaltz, O., & Hochberg, M. E. (2015). Peto’s paradox and human cancers. Phil. Trans. R. Soc. B, 370(1673): 20150104.]
Noble, R., Kaltz, O., & Hochberg, M.E. (2015). Statistical interpretations and new findings on Variation in Cancer Risk Among Tissues. arXiv preprint: 1502.01061.
Peto, R., Roe, F.J.C., Lee, P.N., Levy,L., & Clack, J. (1975). Cancer and ageing in mice and men. British Journal of Cancer, 32(4): 411–426.
Peto R., (1977). Epidemiology, multistage models, and short-term mutagenicity tests. In: Hiatt HH, Watson JD, Winsten JA, editors. The Origins of Human Cancer. NY: Cold Spring Harbor Conferences on Cell Proliferation, 4, Cold Spring Harbor Laboratory. pp. 1403–1428.
Simpson, E.H. (1949). Measurement of diversity. Nature.
Simpson, E.H. (1951). The interpretation of interaction in contingency tables. Journal of the Royal Statistical Society. Series B (Methodological), 238-241.
Simpson, E.H. (2010). Edward Simpson: Bayes at Bletchley Park. Significance, 7(2): 76-80.
Tomasetti, C., & Vogelstein, B. (2015). Variation in cancer risk among tissues can be explained by the number of stem cell divisions. Science, 347(6217): 78-81.