Hacker News new | ask | show | jobs
by kakarot 3514 days ago
OP's comment was succinct and digestible by a non-technical audience with basic biological knowledge. If he had known of the model, surely he could have spared 2-3 sentences like OP.
1 comments

OP's comment misinterprets a quote; it's great to bring up that model, but it's ludicrous to think that Mike Stratton does not know it in far greater detail than nonbel, as I would place Stratton as the world's authority on cancer mutations. He was the senior author on the authoritative survey of cancer mutational processes:

http://www.nature.com/nature/journal/v500/n7463/full/nature1...

In addition, nonbel's snarkiness and assumptions of ignorance end up misinforming HN readers more.

Scientists do not write these articles for the BBC. The reporters pick and choose what to take from the scientist, and report that. Most of the times they probably get most of the quote correct, or the scientist said something that had many of the words in the quote. But generally these types of articles are barely intelligible to the scientists that were interviewed for them.

fair enough, you're probably right
Lets gather info from the paper and see if what they say makes sense. In discussing figure 1, they seem to know this data needs to be normalized to number of cell divisions:

>"The prevalence of somatic mutations was highly variable between and within cancer classes, ranging from about 0.001 per megabase (Mb) to more than 400 per Mb (Fig. 1). Certain childhood cancers carried fewest mutations whereas cancers related to chronic mutagenic exposures such as lung (tobacco smoking) and malignant melanoma (exposure to ultraviolet light) exhibited the highest prevalence. This variation in mutation prevalence is attributable to differences between cancers in the duration of the cellular lineage between the fertilized egg and the sequenced cancer cell and/or to differences in somatic mutation rates during the whole or parts of that cellular lineage1."

And that they believe these mutations are accumulating at a relatively constant rate over time:

>"The mutations in a cancer genome may be acquired at any stage in the cellular lineage from the fertilized egg to the sequenced cancer cell. The correlation with age of diagnosis is consistent with the hypothesis that a substantial proportion of signature 1A/B mutations in cancer genomes have been acquired over the lifetime of the cancer patient, at a relatively constant rate that is similar in different people, probably in normal somatic tissue"

So now let's implement their model with the required assumptions:

  Define the probability a mutation occurs during a given cell division as p.  
  Define the probability does not occur during a given cell division as q = 1-p. 
  Define the number of accumulated mutations required for carcinogenesis as n.
  Define the number of cell divisions that have passed since the zygote as d.
  Define the number of cell lineages in the tissue as Ncell.
  Define the proportion of cancer cells that go on to form detectable tumors as C.

  Assume the mutations can only occur once per cell.
  Assume the mutations are occurring at the same rate (ie p1 = p2 = ... =  pn).
The probability a mutation does not occur during division 1, or division 2, ... or division d would then be given by q^d (since p is constant we simply multiply the probabilities as for independent events).

The probability the mutation did occur at some point up to time d must then be given by 1-q^d. And for the n required mutations we would get

  (1-q^d)^n.
We just derived the CDF of the geometric distribution, extended to allow for multiple parallel events. This is the cumulative probability of a cell lineage turning cancerous according to the mental model they describe in the paper, which is pretty much Armitage-Doll without mentioning the name.

To get the probability of a cell lineage turning cancerous at a given age (ie the pdf of this distribution) we calculate the first derivative of that function (warning: this is a continuous approximation of a discrete process):

  -n*q^d*log(q)*(1 - q^d)^(n-1)
The expected number of cases per person after d divisions (division-specific incidence rate) would then be

  C*Ncell*-n*q^d*log(q)*(1 - q^d)^(n-1)
You can see that only the height of the curve is affected by C and Ncell, the shape is independent of those factors. In the (non-simplified) Armitage-Doll model the shape of the curve depends only on the mutation rate and number of required mutations.

In that paper, they report seeing a range of roughly 10^-9 to 10^-4 cancer-specific mutations per bp in already detected tumors. If those arose after 10 divisions, the mutation rate would be 10^-10 to 10^-5 mutations/bp/division, etc. So we can see those values are empirically determined upper bounds on the mutation rates. So lets use the higher of the two as our value of p. Let us also assume only n = 2 mutations are need accumulate to result in a detectable tumor. Using R to make the upper plot:

  p = 10^-4; q = 1-p; n = 2; d = 1:20000
  plot(d, -n*q^d*log(q)*(1 - q^d)^(n-1),  type = "l",  
       xlab = "Divisions since Zygote",  ylab = "Pr(a Cell Lineage Will Turn Cancerous)")
  abline(v = log(1/n, base = q))
https://s14.postimg.org/p6wncjv9d/melan.jpg

Actually, by setting the second derivative of that CDF to zero, we can see that the Armitage-Doll model predicts a peak in age-specific incidence at log(1/n, base = q) divisions (vertical line on the upper plot). That 10^-4 value comes from Melanoma, so let us also look at the age-specific incidence for that cancer (lower plot). There we see the peak incidence occurs at age ~age 90. So according to their model, the skin cells that are causing melanoma must be ~7k divisions separated from the zygote, corresponding to an average of ~78 divisions each year, or every ~5 days. Is that what happens?

Remember, we used a real upper, upper bound here on the mutation rate from their data, and only 2 required accumulated mutations. Even then we are getting into cells that are 78 generations separated from the zygote before being cancerous. What you will find is that the division rates required to fit what people really suggest (eg p=10^-7 and n=3) are insane according to the accepted model. If they have a different model than that, why do they not write it down and compare to epidemiological data?

This isn't like a long crackpot screed. It is a couple paragraphs... Why downvotes without explanation? Where did I go wrong (I see some typos at the bottom "78 generations separated" should be "7000", but that shouldn't be a huge deal)?
I didn't downvote, but it definitely does come across as a crackpot screed. You are discounting data in favor of an overly simplistic statistical model. You also greatly misinterpret a key point in the quote: they're not saying all mutations accumulate at a constant rate, but only a few signatures appear to. In fact, the very next sentence of that quote is:

>The absence of consistent correlation of all other signatures with age suggests that mutations associated with these have been generated at different rates in different people, possibly as a consequence of differing carcinogen exposures or after neoplastic change has been initiated.

This is a classic crackpot technique: selectively quote just the parts that you want them to say, twist it a bit further to your needs, then proceed with an overly simplistic, but supposedly impressive analysis. I don't know or really think that you are a crackpot, but the quoting behavior is quite telling.

Getting back to your original comment, you accuse the authors of this paper:

http://science.sciencemag.org/content/354/6312/618.full

of not knowing what they're talking about. But in reality, you have already mistaken the type of process that's being talked about. Stratton is talking about a biological and chemical process. You're talking about a "random" process from statistics. An old theory, that uses simplifying assumptions that do not apply with this data.

And finally, the most obvious reason that the Armitage Doll process is not the best explanation is that AD were looking at the process of carcinogenesis. This paper is looking at the various processes of mutations that happen because of a carcinogen. These are different things, especially since mutational processes accelerate after carcinogenesis. I believe the paragraphs that you would find most interesting from the paper are here:

>Signature 5 is found in all cancer types, including those unrelated to tobacco smoking, and in most cancer samples. It is “clocklike” in that the number of mutations attributable to this signature correlates with age at the time of diagnosis in many cancer types (17). Signature 5, together with signature 1, is thought to contribute to mutation accumulation in most normal somatic cells and in the germline (17, 23). The mechanisms underlying signature 5 are not well understood, although an enrichment of signature 5 mutations was found in bladder cancers harboring inactivating mutations in ERCC2, which encodes a component of NER (24).

>Signature 5 (or a similar signature that is difficult to differentiate from signature 5 because of the relatively flat profiles of these signatures) was increased by a factor of 1.3 to 5.1 (q < 0.05; table S2) in smokers versus nonsmokers in all cancer types together and in lung squamous, lung adenocarcinoma, larynx, pharynx, oral cavity, esophageal squamous, bladder, liver, and kidney cancers. The association of smoking with signature 5 mutations across these nine cancer types therefore includes some for which the risks conferred by smoking are modest and for which normal progenitor cells are not directly exposed to cigarette smoke (Table 1). Given the clocklike nature of signature 5 (17), its presence in the human germline (23), its ubiquity in cancer types unrelated to tobacco smoking (18), and its widespread occurrence in nonsmokers, it seems unlikely that signature 5 mutations associated with tobacco smoking are direct consequences of misreplication of DNA damaged by tobacco carcinogens. It is more plausible that smoking affects the machinery generating signature 5 mutations (24). Presumably as a consequence of the effects of smoking, signature 5 mutations correlated with age at the time of diagnosis in nonsmokers (P = 0.001) but not in smokers (P = 0.59).

Armitage Doll relates at most tangentially to what is being reported by these scientists.

Thanks for responding, I hope to get back to you in more depth. But first of all:

>"An old theory, that uses simplifying assumptions that do not apply with this data."

Yes, get rid of one simplifying assumption that was originally introduced for computational reasons and is totally unnecessary today (low mutation rate), and you can see it is impossible for that theory fit the age-specific incidence data using accepted mutation rates + division rates.

Something is wrong, yet in the supplement of the Alexandrov et al (2016) paper, which has the same first and last authors as Alexandrov et al (2013) paper you cited, they use this model without comment on that issue.

Also, in the 2013 paper, Armitage-Doll is not mentioned but it is clear to anyone familiar with that model that it is guiding their interpretation of the results.