# Cancer, bad luck, and a pair of paradoxes

Among the highlights of my recent visit to IMO were several stimulating discussions with Artem Kaznatcheev. I’m still thinking over my response to his recent post about reductionist versus operationalist approaches in math biology, which is very relevant to some of my current research. Meanwhile, at Artem’s suggestion, this post will discuss a reanalysis of the “cancer and bad luck” paper that spurred so many headlines at the start of this year. Whereas many others have written critiques of that paper’s statistical methods and interpretations, my colleagues and I instead tried fitting alternative models to the underlying data. We thus found ourselves revisiting a couple of celebrated scientific paradoxes.

To start this post, I will introduce you to Simpson’s paradox and Peto’s paradox. With these pair of paradoxes in mind, we’ll turn a critical eye to Tomasetti & Vogelstein (2015), and I will explain our reanalysis of their data set.

Edward H Simpson has several claims to fame. He began his statistical career as a wartime code breaker at Bletchley Park, applying Bayesian methods to speed up the deciphering of German and Japanese ciphers (for the history, see Simpson, 2010). In ecology, he is noted for introducing a widely-used measure of diversity (Simpson, 1949). After the war he became a distinguished civil servant. But perhaps Simpson’s best known achievement is a paper (Simpson, 1951) he wrote as a graduate student, characterising a puzzling quirk of statistics that now bears his name.

To understand the phenomenon, consider the relationship between the numbers of heads (H) and tails (T) observed in experiments, where each experiment comprises N = 100, 200, or 300 coin tosses. Across the whole data set, we’d expect to see a positive correlation between H and T because the expected values of each variable scale with N (as illustrated in the figure at right). However, for each distinct value of N we have HN – T. Therefore if we split the data into three subsets according to sample size (N = 100, N = 200, and N = 300) then H and T will be negatively correlated within each subset (the differently coloured groups in the figure at right). This is Simpson’s paradox: an observed trend in data can change and even reverse when the data are split into subsets, leading to all sorts of counterintuitive results.

Another English statistician, Sir Richard Peto, gave his name to our second paradox, which concerns cancer incidence across species (something that Artem has touched on before; for original paper, see Peto et al., 1975; Peto, 1977). It is thought that cancer most commonly arises due to DNA modifications during cell division (specifically, stem cell division). Therefore we might expect large animals to have many more tumours than small ones, simply due to the vast difference in total number of cells and therefore in cell divisions per body. Scaling up from human cancer rates, we’d expect blue whales to be overwhelmed by cancer. Peto’s paradox is the observation that, at the species level, there appears to be no such correlation between the incidence of cancer and the number of organismal cells. Evolutionary theory provides a possible solution to this paradox, because selection for cancer-suppressing traits should be stronger in larger organisms.

Thus primed in paradoxes, let’s return to the work of Tomasetti & Vogelstein (2015). The idea behind their study is appealingly simple. As already mentioned, it’s thought that cancer usually starts during stem cell division. Consistent with this theory, Tomasetti & Vogelstein (2015) found a correlation between cancer incidence per tissue and the lifetime number of stem cell divisions within the tissue. For example, stem cells in the brain seldom divide and brain tumours are generally rare, whereas the colon has a high turnover of stem cells and is more commonly afflicted by cancer.

Two observations inspired myself, my boss Michael Hochberg, and our colleague Oliver Kaltz to reanalyse this data set. First, if every stem cell division carries the same risk of seeding cancer then the slope of the correlation should be 1. In other words, doubling the number of divisions should double the risk of cancer. However, the correlation identified by Tomasetti & Vogelstein (2015) has a slope of only ~0.5 (on the log-log scale). Second, the data set is not representative of all cancer types: breast and prostate cancers are omitted (due to lack of reliable measurements), whereas other cancers such as osteosarcoma (bone cancer) are seemingly overrepresented. Such sample bias could skew the analysis.

Cancer risk versus lifetime number of stem cell divisions (lscd) on a log-log scale. Each colour corresponds to a different anatomical site, and the coloured lines are best fit of our model. The length of each coloured line represents the variance in risk due to the lifetime number of stem cell divisions for that anatomical region or tissue type, and the spacing between coloured lines represents the variation due to tissue type. Based on figures 5A and 6A of Noble et al. (2015).

Our response to these issues was to change the statistical model. Instead of fitting a regression model to all cancer types together, we subsetted the data by anatomical site (bone, thyroid, pancreas, etc.). Thus we divided the variation in cancer risk into two parts: between-group variation associated with anatomical site, and within-group variation associated with number of stem cell divisions (see figure at right). Our hypothesis was that the correlation within groups might differ from the correlation across the whole data set. In other words, we predicted that the location of a cancer within the body might be one of the “lurking explanatory variables” that underlie Simpson’s paradox.

As suspected, the subsets revealed a very different pattern compared to the combined data (see figure at right). Within each of the groups defined by anatomical site, the slope of the correlation between cancer risk and lifetime number of stem cell divisions is approximately 1, exactly as predicted by Tomasetti & Vogelstein’s biological hypothesis. However, there is also large variation between subsets, which means that the cancer risk per stem cell division varies enormously depending on anatomical site. For example, our results suggest that a stem cell division is ~10,000 times more likely to lead to cancer if it occurs in bone or in the thyroid than in the small intestine.

And here is where we find the connection to the second paradox. Just as in differently sized species, we suggest that anatomical sites with high numbers of stem cell divisions have evolved more powerful anti-cancer mechanisms. The lining of the colon is constantly renewing itself through stem cell divisions, so we would expect the colon to be especially good at lessening the risk of tumourogenesis per division. Conversely, bone stem cell divisions are rare, so we would not expect bones to invest so heavily in cancer prevention. Although the risk of each cancer type correlates with the lifetime number of stem cell divisions within each anatomical site, there is no such correlation between anatomical sites. Instead, the risks of the most common cancers of each anatomical site appear to saturate at around 1%. Similarly, there may be a correlation between cancer risk and body size within species (e.g. dogs and humans) but not between species. It seems that Peto’s paradox applies to anatomical sites just as well as it applies to species.

In summary, by applying an only slightly more complex statistical model, and by viewing our results in the light of evolutionary theory, we have obtained quite different results from Tomasetti & Vogelstein’s data, adding to the understanding of carcinogenesis and the evolution of cancer control. You can find out more by downloading our arXiv preprint (Noble et al., 2015). Comments are especially welcome now as we’re preparing to submit this work to a journal soon.

[Editor’s note: this work has now appeared as: Noble, R., Kaltz, O., & Hochberg, M. E. (2015). Peto’s paradox and human cancers. Phil. Trans. R. Soc. B, 370(1673): 20150104.]

### References

Noble, R., Kaltz, O., & Hochberg, M.E. (2015). Statistical interpretations and new findings on Variation in Cancer Risk Among Tissues. arXiv preprint: 1502.01061.

Peto, R., Roe, F.J.C., Lee, P.N., Levy,L., & Clack, J. (1975). Cancer and ageing in mice and men. British Journal of Cancer, 32(4): 411–426.

Peto R., (1977). Epidemiology, multistage models, and short-term mutagenicity tests. In: Hiatt HH, Watson JD, Winsten JA, editors. The Origins of Human Cancer. NY: Cold Spring Harbor Conferences on Cell Proliferation, 4, Cold Spring Harbor Laboratory. pp. 1403–1428.

Simpson, E.H. (1949). Measurement of diversity. Nature.

Simpson, E.H. (1951). The interpretation of interaction in contingency tables. Journal of the Royal Statistical Society. Series B (Methodological), 238-241.

Simpson, E.H. (2010). Edward Simpson: Bayes at Bletchley Park. Significance, 7(2): 76-80.

Tomasetti, C., & Vogelstein, B. (2015). Variation in cancer risk among tissues can be explained by the number of stem cell divisions. Science, 347(6217): 78-81.

About Rob Noble
I am a lecturer in applied mathematics at City, University of London. I use mathematical and computational models to investigate evolutionary and ecological systems. I am currently working, in close collaboration with laboratory scientists, on models of cancer evolution and the development of drug resistance. My methods include game theory, analysis of dynamical systems, spatially structured models, and Bayesian inference. During my PhD at the University of Oxford (2009-2013) I used mathematical models, informed by statistical analysis of laboratory data, to understand the immune evasion mechanisms of the malaria parasite Plasmodium falciparum.

### 22 Responses to Cancer, bad luck, and a pair of paradoxes

1. Steve says:

Very nice post! It seems that you’ve done the work that T&V should have done in the first place.

2. Steve says:

One more comment. The remark about the adaptive benefits of lower cancer risk per stem cell division in the colon vs. bone makes eminent sense to me. Yet I have a nagging suspicion that invoking natural selection in this way might not be justified. Since cancer is highly correlated with aging, it could be that getting cancer is adaptive and in fact nature evolved a special mechanism to make sure that bone will eventually get cancer despite its lower cell division count. In other words, the evolutionary logic could be entirely reversed. Colon is what it is and evolved with no care about cancer, which kicks in with aging. On the other hand, nature put a lot of effort into making sure that bone would eventually get cancer and old people would die. So instead of the evolution of anti-cancer mechanisms, we may have had the evolution of pro-cancer mechanisms. I am not saying I believe any of this. I am only wondering how it can be ruled out and, if it cannot, why is it useful to bring in natural selection at the level of the species (as opposed to the cellular level). This is not a criticism: it’s an honest question that only reflects my ignorance.

3. David Colquhoun says:

This is a fascinating analysis. I suggest that the paper (and the blog post) should contain a clear, and (as far as possible) non-technical statement of the extent to which the cancer can be regarded as a result of chance, “luck” (excluding, of course, lung cancer in cigarette smokers). It is this aspect of T&V which caused most comment, and not infrequently, outrage, If it isn’t dealt with explicitly in the paper, you’ll probably have to write another ‘clarification’.

4. David, in reply to your comments, our analysis (Noble et al. 2015) indicates that incidence can be predicted for the cancers covered in the Tomasetti and Vogelstein dataset based on knowledge about anatomical sites and the total number of stem cell divisions. Two important points emerge. First, based only on this data set, it is not possible to assess to what extent incidence and variation in incidence across cancers is due to the environment and/or “natural” probabilities of cellular transformation to malignancy (e.g., random mutations). That the slope in our statistical analysis is approximately 1 suggests that if environments were having a significant impact on cancer incidence (again, for the subset of cancers used in our analysis), then these environmental effects would appear to be affecting one or more anatomical sites in some proportional kind of way (so that slopes remain little changed). It is impossible to assess this hypothesis without additional data and therefore, it is premature to conclude that even for the cancer types without any apparent major environmental causes, more subtle environmental effects (i.e., which would reduce the “bad luck” component of the inference) do not play a role. This said, even if we are able to identify how such environmental factors may increase cancer incidences across the board, to the extent that exposure is unavoidable and secondary prevention non-existent, the effect of natural and added environmental effects are both effectively “bad luck”. Thus a slope of 1 indicates that variation within each anatomical site is due to “bad luck”, but we cannot assess the relative impacts of bad luck and the environment between anatomical sites.

Second, our analysis says that simple rules (effects of total number of stem cell divisions and anatomical site) predict incidence with high confidence. This is a surprising result and suggests that having fewer or greater numbers of stem cell divisions for a given anatomical site would put a given individual at lower or greater risk, respectively, for cancer in that site. Our analysis did not statistically investigate how differences in the total number of stem cell divisions between individuals was predictive of cancer risk, but some studies are suggestive of this type of effect (e.g., Roychoudhuri et al. 2006; Kabat et al. 2013). Thus, to the extent that a given individual is potentially more prone to certain cancers based on more expected lifetime stem cell divisions, this can be regarded as an uncontrollable causal factor should the cancer be obtained (i.e., “bad luck”).

Therefore, we cautiously conclude that if we call “bad luck” our lack of ability (based on for example, life-style changes or chemoprevention) to reduce cancer risk, and Tomasetti and Vogelstein’s data are indeed accurate, then random effects do explain some and possibly most of the variation, when accounting for anatomical site (Noble et al. 2015). Nevertheless, we do know that primary and secondary prevention can reduce probabilities of morbidity and mortality for a range of cancers (e.g., Martin-Moreno et al. 2008). Thus, even if some degree of “bad luck” contributes to explaining variation in cancer incidence, this does not lessen the fact that life-style and active forms of prevention will influence a person’s life-time risk of cancer.

Kabat GC, Anderson ML, Heo M, Hosgood HD, Kamensky V, Bea JW, Hou L, Lane DS, Wactawski-Wende J, Manson JE, Rohan TE. (2013). Adult stature and risk of cancer at different anatomic sites in a cohort of postmenopausal women. Cancer Epidemiol Biomarkers Prev. 22(8):1353-63.

Martin-Moreno JM, Soerjomataram I, Magnusson G. (2008) Cancer causes and prevention: A condensed appraisal in Europe in 2008, European Journal of Cancer 44(10):1390-1403.

Noble R, Kaltz O, Hochberg ME. 2015. Statistical interpretations and new findings on Variation in Cancer Risk Among Tissues. arXiv q-bio arXiv:1502.01061

Roychoudhuri, R, Putcha, V, and Moller, H. (2006). Cancer and laterality: a study of the five major paired organs (UK). Cancer Causes & Control 17(5):655-662

• David Colquhoun says:

Thanks very much for that clear exposition. I hope the some version of it will appear in your final paper.

Martin-Moreno et al. (2008) is,of course, all based on associations. If you don’t smoke cigarettes, and drink alcohol in moderation, the other lifestyle effects are small, and causality is dubious. I’d presume, therefore, that for this group at least, most cancers are a matter of bad luck.

• Agreed, indeed to the extent that we could do better controls (RE the Martin-Moreno et al study), we may be able to detect a signal, but still would not necessarily be able to determine causation, nor whether causation is apparent, but other unquantified intervening factors (past events, genetics, etc) were not also ‘causal’. Doll and Peto’s classic work (BMJ 1976; pg 1531, tables IX and X) is more suggestive about the effects of ceasing to smoke. Cuzick et al’s work (Fig 1, Annals of Oncology 26: 47–57, 2015) on aspirin too would appear to be as far as we can go with human subjects in estimating causal effects.

This all said, the results published in our arXiv article are indicative that “bad luck” could play a considerable role in the subset of cancer data we analyzed. Again, we use “bad luck” in a contrasting way to the use in Tomasetti and Vogelstein. In our study, “bad luck” could have environmental components, increasing incidence beyond expectations based on background levels of variation in stem cell number, division rate, and mutational probabilities. This will be clarified in subsequent publications of our work.

5. Ian Johnson says:

Excellent post. I have a couple of questions. The original analysis was based, I think, on US incidence rates only. However some cancers, notably colon and esophagus, show marked variations across populations (at least 10 fold for colon and even higher for squamous cell esophagus within e.g. China). Can these issues be addressed or acknowledged within you analysis? Secondly, from memory, I think colon and small intestine have somewhat similar stem cell divisions but markedly different incidence rates. Any comments?

• A similar question was asked on my G+ share of this post, and since the discussion is slightly splintered I will try to recreate some of the answer I wrote there. Rob or Mike will probably clarify this further, since this is based on my impression/memory of their work.

When he was presenting their reanalysis, Rob made a good point about the single-patient-type data in relation to skin cancers. The idea is that having single patient types, even weird ones, is actually useful for their analysis because they are breaking things down by tissue type, and different tissues can have different base rates in different environments. So you see the fit with skin cancer in the west, but maybe the base rate is elevated because people generally like to sunbathe. If you looked at some other demographic where sunbathing or exposure was less common then you might see a lower base rate but still the same slope of 1 on the log-log scale. Similar with colon and esophagus cancers, in China you might have a base rate that is higher (so the green and black lines, say, are further to the right on the main plot), but you will still expect the same slope (i.e. still slope 1 on the green and black lines).

I am not 100% sure if this is how Rob is thinking about this, so hopefully he will correct me if I am wrong.

The exciting point for me here, is also that as we move to the stomach, say, we also can get interesting effects from a perspective that is not as reductionist as ‘mutations within individual cells’ by ecological interactions with things like H. pylori. Of course, even in that case, though, it might be still linked to higher rates of stem cell divisions due to inflammation.

• Ian Johnson says:

Thanks. It would be interesting to see the effect of introducing epidemiological data for some cancers which do show significant geographical variations into this analysis. Also, on closer inspection I see the two points for large and small intestine, but how close is the slope to 1? I think that, unlike colon, the incidence rates for small bowel are consistently low across populations.

• Rob Noble says:

Artem’s reply saves me answering your first question, Ian, as he expresses my argument very well. We agree that it would be interesting to repeat the analysis with data from different populations; this is one of a few follow-up studies we have in mind.

The difference between colon and small intestine is especially interesting, I think. It may be that grouping these two organs into a single subset is invalid, because they have different stem cell populations and different microenvironments. Richard Peto predicted that the pattern he observed between species might also apply to human organs, and he specifically mentioned the small intestine as a site expected to have very powerful anti-cancer mechanisms (see page 13 of this 1979 article: http://libgallery.cshl.edu/items/show/74089). In the published version of our article, I hope we may be able to say a bit more to say about differences between the colon and the small intestine (and, perhaps, even between the duodenum, jejunum, and ileum).

6. Ian Johnson says:

Yes that would be valuable. One of the problems with the T&V paper was the failure to make such distinctions. For example they did not distinguish between squamous cell carcinoma of the esophagus and adenocarcinoma, although the epidemiology differs considerably. Their analysis also sorted esophageal cancer into the group mostly attributable to “bad luck”, and yet the adverse effects of alcohol and tobacco use on the risk of squamous type esophageal cancer seem indisputable to me.

7. Rob Noble says:

An updated version of our analysis has been published as “Peto’s paradox and human cancers” in Philosophical Transactions of the Royal Society B: http://rstb.royalsocietypublishing.org/content/370/1673/20150104.

8. alistair bain says:

Do you think small intestine cancer is rare because of antioxidant enzymes in intestinal cells.There are many more oxidant molecules in the small intestine because of the large number of mitochondria providing energy to process food.Or is dna polymerase more reliable in intestinal cells? Is dna polymerase in the intestine flexible allowing the genes that code it to mutate but still produce a normal polymerase?

9. alistair bain says:

It would be interesting to plot number of mitochondria per cell type versus the cancer incidence.Given the ability of mitochondria to damage dna with oxygen species would this be a straight line?

This site uses Akismet to reduce spam. Learn how your comment data is processed.