Hacker News new | ask | show | jobs
by tripletao 1380 days ago
Have you looked at Pekar's full model, as described mostly in the supplementary materials? A typical molecular clock approach wouldn't give anywhere near the accuracy necessary to exclude evolution of lineage B (just two SNPs away) in humans. Pekar instead builds layer upon layer of complexity, with dozens of reasonable but somewhat arbitrary judgment calls, in the same general direction as econometrics. From the shape of the resulting modeled phylogenetic tree, he purports to exclude a single introduction into humans.

I'm not aware of any case where any similar model has been shown to have predictive power, and there's inherently no way to validate this one against any physical data. So I believe this result has been grossly oversold, per my comments and links at

https://news.ycombinator.com/item?id=32740568

1 comments

> A typical molecular clock approach wouldn't give anywhere near the accuracy necessary to exclude evolution of lineage B (just two SNPs away) in humans

You're ignoring other data which is counter to the idea of B evolving from A in humans. Pekar's models are not the only evidence.

- Early cases were predominantly B - A shows less generic divergence than B, this is what Pekar is talking about with regards to the discontinuity in the early clock.

When we first started discussing this - I spoke up because I was annoyed by you trashing peer-reviewed papers when it was obvious you weren't even attempting to grok the phylogenetics involved. Still annoyed.

It's been genuinely interesting watching the scientific debate to root the SC2 tree over the past few years because of the involved paradoxes.

"Just a few SNPs" is just such a silly argument when stacked against peer-reviewed phylogenies in high-impact publications.

Have you looked at Pekar's full numerical stack yourself, as described in their supplemental materials? If yes, then why are you confident that their choice of the Barabasi-Albert algorithm to generate a fixed infection network correctly models the earliest spread of SARS-CoV-2 in humans? In particular, why choose to study robustness against doubling time (which seems intuitively like it wouldn't affect the shape of the tree much), but not robustness against that connectivity (which seems intuitively like it would)?

The rest of their arguments depend fundamentally on the polytomy thing, because nothing else excludes an earlier (even September) first introduction into humans. With an earlier introduction and thus more extensive unsampled spread, it's much harder to insist that A and B would be first sampled in the same order in which they evolved in humans, or make any similar early claims with confidence.

You are correct that I hadn't fully understood their polytomy argument before you brought it up, and I appreciate you bringing it to my attention. I still don't think it's very good, though. I later found Erik van Nimwegen's criticisms, which roughly followed my own; so I don't think I'm taking a fringe position here. Indeed, I've never seen anyone citing or defending Pekar engage in any way with the numerical complexity of that model. It seems like anyone who's looked inside the box becomes a critic, thus my hope that you'll do so.

High-impact publications have shown unfortunate willingness to publish low-quality work that would exclude research-related origin of SARS-CoV-2. For example, I assume you followed Nature's publication, editor's note, and ultimate extensive correction of their pangolin paper, and that you agree pangolins aren't the proximal host. This makes me less inclined to trust in their reviewers here, and more inclined to trust my own judgment (or that of the two Twitter threads I've linked elsewhere).

> In particular, why choose to study robustness against doubling time (which seems intuitively like it wouldn't affect the shape of the tree much)

As I understand it, the doubling times observed in the simulations were primarily the result of the ascertainment and transmission rate parameters.

Care to elaborate why you think the robustness of the model with respect to transmission rate should be assumed? I don't share your intuition here, and note that the authors observe, "that sensitivity analyses with longer doubling times increase the support for multiple introductions."

You really fault them for robustness analysis here?

To be clear I don't fault them for studying robustness against doubling time; I fault them for not studying robustness against connectivity of the infection network, since that seems like it would be more important than any of the parameters that they did study. My intuition is that when spread is highly deterministic (e.g. if R0 = 2 and each patient infects exactly two others), it's easy to make inferences about past spread from the present. For example, in that case it really would be near-impossible for a later lineage to outcompete an earlier one.

But we know the spread of SARS-CoV-2 is actually stochastic, with most lineages dying out but a few exploding due to super-spreader events. In that case it's much harder to judge whether a clade is big because it had more generations to grow, or just big because of a few (un)lucky founder effects. In Pekar's epi simulation, that stochasticity is modeled by their connectivity network. I expect that a more overdispersed network (i.e. greater variance in the number of edges at each vertex, keeping the same average) would make non-modal outcomes--like the real pandemic's phylogeny, if it arose from a single introduction--more likely.

Their results of the simulations are stochastic. They discuss this in-depth, as it complicates their analysis.

I don't understand what you're trying to say. Everyone agrees that the spread is stochastic. Why are you starting with a hypothetical misinterpreation of an R value to make a deterministic strawman? You think that their simulations were too deterministic because of their connectivity network?

> -like the real pandemic's phylogeny, if it arose from a single introduction-

Propose a phylogeny already. Root this thing.

> You think that their simulations were too deterministic because of their connectivity network?

Yeah, pretty much; and it's what other critics, including well-credentialed mathematical biologists, are saying too. There's a continuum of dispersion, with my perfectly-deterministic strawman at the left extreme but extending to infinity. Their power-law network adds some dispersion, but how do we know it's enough? I believe they chose that distribution because it's been shown to fit some real data (including the spread of HIV) reasonably well; but how do we know it fits the early spread of SARS-CoV-2, in the earliest lineages of the virus with unknown biology, in an unknown group of people with unknown behaviors?

I don't know how to root the phylogeny, and I'm mistrustful of anyone who claims they can based on the limited information available. Anyone who's built and attempted to validate mathematical models knows that sometimes, there's simply not enough information to confidently reach any useful conclusions. Absent validation of the approaches used here (e.g. evidence that they've successfully made predictions in the past in similar situations), I believe that's our situation here.

Yes, I've reviewed the supplemental materials.

> because nothing else excludes an earlier (even September) first introduction into humans. With an earlier introduction and thus more extensive unsampled spread, it's much harder to insist that A and B would be first sampled in the same order in which they evolved in humans

The tMRCA clearly excludes an earlier introduction. Because the tMRCA is based on genetic diversity, you cannot calculate a tMRCA based on all the known samples, get a date, and then say "oh, geez- well, there was also wide cryptic spread before that." It just doesn't make sense. Pekar addresses this point directly.

A race between the first A and the first B is a strawman. Rather, it's the predominance of lineage B over A in the early pandemic which is interesting. It would be unexpected for lineage B to dominate if A came first. Much of the modeling is to get a handle on how unlikely that situation would be. It shouldn't be surprising that the models don't support it as being likely. (But, that's not the only evidence.)

If you're willing to actually think about and engage on the phylogeny - stop with the "just a few SNPs" nonsense, and ask yourself what you really think the early origins looked like. If it really was a single introduction - Was lineage A ancestral? Was B ancestral? A C/C ancestor? A T/T ancestor? All these have interesting problems being supported by the data.

Finally, after reading some of your earlier comments, I'm realizing that you're conflating several techniques from Pekar's paper, eg:

> Have you looked at Pekar's full model, as set out mostly in the supplementary materials? This isn't any standard molecular clock approach. It's a byzantine stack of plausible but somewhat arbitrary assumptions, ending in a simulated phylogenetic tree.

His epi simulations are separate from the tree-building, with the possible exception of rooting, which he was using the output of the models to inform. Otherwise, the epi modeling which everyone is hand wringing over is really separate and doesn't end "in a simulated phylogenetic tree."

There /are/ novel methods used in the tree building (eg, non-reversibility of base substitutions), but that's a whole separate technique.

> Essentially Pekar's argument is a "two introductions of the gaps"--that if their model of a single introduction doesn't conform to reality, then it must have been two introductions.

BS. Again - understanding the paradoxes and debate involved in rooting the tree is basically required to understand the importance of this paper. The existing data is confounding and didn't conform to a logical understanding of viral evolution. A separate introduction elegantly explains the existing evidence.

If their modeling isn't strong enough evidence for you, fine. But that's different than throwing everything out because you don't understand how "just a couple SNPs" can still provide sufficient resolution to make phylogenetic inferences possible. If you think that "just a couple SNPs" /don't/ provide enough for experts in the field to inform their phylogenies, at least get to that argument directly instead of throwing ignorant shade at an unrelated portion of the paper.

Thanks for the links to those other threads. Nod's was interesting, but AFAICT, way off-base, starting around "Needless to say, early winter in Wuhan is not the Mardi Gras."

Here's Pekar's earlier thread which I recently reread and found helpful for understanding the significance of the phylogeny (#20 is where he gets into how lineage A breaks the clock):

https://twitter.com/jepekar/status/1499840335349911553

and Worobey re-emphasizing that we're not just talking about a few SNPs, it's the shape of the tree which matters:

https://twitter.com/michaelworobey/status/157050467474223923...

I think you're talking about their model in "Inferring the MRCA of SARS-CoV-2", and I'm talking about their model in "Separate introductions of lineages A and B"? So you're saying they don't use the epi simulations to root and build the phylogenetic tree of real sampled genomes, which is true. I'm saying they do use the epi simulations to build a phylogenetic tree for each simulated pandemic, whose shape (polytomy structure) they then compare against the real tree:

> We simulated SARS-CoV-2–like epidemics (22, 23) with a doubling time of 3.47 days [95% highest density interval (HDI) across simulations, 1.35 to 5.44] (24–26) to account for the rapid spread of SARS-CoV-2 before it was identified as the etiological agent of COVID-19 (figs. S21 and S22, tables S3 and S4, and supplementary text). We then simulated coalescent processes and viral genome evolution across these epidemics to determine how frequently we recapitulated the observed SARS-CoV-2 phylogeny.

Coverage of this paper in the popular press usually said something like "study finds that SARS-CoV-2 arose from two introductions into humans", so I thought the latter was the more important result and started there. Like in your second link, Worobey says:

> [...] We then go on the explain, point by point, that it is not a two-mutation difference that is unexpected. It is a two mutation difference between two large clades like lineage A and lineage B, each displaying a MASSIVE polytomy at their root. This is something that [sic] DO NOT see in ~99.5% of simulations. That is the crux of the paper. Not the idea that two mutations can't happen in a single transmission event.

Are those "simulations" not the SIR-type epi simulations (followed by simulation of the mutations and sampling, then construction of the tree)? I believe his 99.5% is 100% minus the 0.5% from Figure 2C.

Their former model is of course independent of their SIR stuff, and indeed purports to independently establish tMRCA in humans too recent for significant cryptic spread. It carries a different set of plausible but arbitrary assumptions though, again about the stochasticity/overdispersion and sampling rate of early spread, just less directly.

Glad we're on the same page about the multiple techniques now. Statements you made like, "Pekar et al. do some complicated phylogenetic modeling that purports to show the MRCA in humans is too recent" and "This isn't any standard molecular clock approach. It's a byzantine stack of plausible but somewhat arbitrary assumptions" made it clear there was confusion before. Their tree is based off a couple novel modification to established techniques. Your characterizations were inaccurate and laughable.

> It carries a different set of plausible but arbitrary assumptions though, again about the stochasticity/overdispersion and sampling rate of early spread, just less directly.

So, you don't only have problems with the modeling of the authors, but their base phylogeny too? Do you reject their tMRCA? Good grief.

I'm still looking forward to discussing the molecular phylogenetics of this paper sometime.

On reflection, I believe the first of my statements that you've quoted was indeed incorrect, and that I was also incorrect when I just wrote:

> Their former model [...] purports to independently establish tMRCA in humans too recent for significant cryptic spread.

Even if SARS-CoV-2 really entered humans in December, with minimal cryptic spread, that's still enough time for the two lineages to evolve in humans, since they're (sorry) just two SNPs apart. I believe Worobey knows this, and that's the reason why he emphasizes the "Separate introductions" model, since their polytomy thing--and not any question of time for cryptic spread--is their best and only argument to exclude that. So I was wrong to mention the tMRCA at all, since even perfect knowledge of that wouldn't tell us confidently how the two lineages arose.

The second of my statements seems correct to me. Not only is their argument for two introductions not a standard molecular clock approach, but it's not a molecular clock approach at all, since "Inferring" provides no support. Their only support comes from the polytomy thing in "Separate". This makes the accuracy of their epidemiological simulation highly relevant, thus the "hand-wringing" over that.

I'd note that you yourself referred me to "Separate", back in:

https://news.ycombinator.com/item?id=32258096

So why did you switch to "Inferring"? I guess we could discuss that too, but per above I don't believe that could provide significant support for two introductions into humans, and thus not for natural vs. research-related origin. Do you believe otherwise? Or do you just mean the approach is of general interest, independently of that question of origin?