Hacker News new | ask | show | jobs
by jpeloquin 1335 days ago
"To avoid losing power with multiple comparisons, we focus our analysis on the BsaI/BsmBI sites in SARS-CoV-2 and compare the BsaI/BsmBI map in SARS-CoV-2 to all other restriction maps of all other CoVs used in our analysis."

It's important to know whether or not the authors picked BsaI & BmBI blind, before looking at the genome. If they picked BsaI & BmBI with knowledge of the SARS-CoV-2 genome, that doesn't dodge the multiple comparisons problem and the p-values aren't reliable. I guess it depends on how many other commonly used type IIS endonucleases there are. The authors use 214 to generate their null hypothesis distribution for the CoV restriction maps but say only 6 are specifically amenable to BAC cloning.

The "wild type distribution" null distribution for fragment length (Figure 3C) being a simulation (permutation of known CoV genomes, split at randomly selected restriction sites) bothers me. On the first read I thought it was a distribution of fragment lengths in real viruses. Does synthesizing virtual genomes by permutation produce a realistic distribution of fragment lengths?

4 comments

Bioinformatician here, BsaI and BsmBI are Type IIS restriction enzymes. Which means, they are unique in the fact that they cleave DNA at a defined distance outside of their recognition sequence.

BsaI has been used in high throughput assembly techniques such as Golden Gate assembly and Golden Braid assembly.

Golden Gate assembly is an extremely robust method for building modular genetic components. For example, one can create plasmids (circular pieces of DNA) with billions of variants of the spike proteins, each carrying a different combination of mutations. Then, those plasmids are transfected into corona viruses and incubated in a tissue culture. Now, one usually let natural selection do its thing and the most infective variants replicate in the tissue much faster and take over the population.

Having said all that, type 2s restriction sites usually are cut out during assembly so I'm not sure how having those is a good evidence for engineering. Actually, the opposite is true. Having none at all is one evidence which is very much suspicious

They weren't picked blind, they were picked since they are commercially available and conveniently split the genome into 6 similarly sized segments with the spike protein entirely in one. It's precisely what a bioengineer would have elected to use and why the paper concentrates on it.
> It's important to know whether or not the authors picked BsaI & BmBI blind, before looking at the genome.

You could never convince me that these restriction enzymes were picked blindly, no bioinformatician I have ever met does science this way. There is a preliminary period of exploratory data analysis which is done before any hypothesis is put forward, and data dredging and leakage are rampant in the literature.

That's not to say that the spacing of BsaI/BsmBI restriction sites isn't noteworthy, just something to keep in mind.

To your point however, could someone comment on the suitability of BsaI/BsmBI for the in vitro assembly of synthetic coronaviruses? Is it all just about finding sites in the genome at the right locations which can be turned into restriction sites without disrupting any existing functional genetic elements? or is there more to it than that. If a research team were to come along and decide they wanted to engineer their own coronavirus, how likely would it be that they would choose these restriction enzymes?

> To your point however, could someone comment on the suitability of BsaI/BsmBI for the in vitro assembly of synthetic coronaviruses?

These are very commmon enzymes. Perhaps the most common today.

The GP comment is sort of misleading...you wouldn't just pick enzymes at random to do this analysis. You'd pick the enzymes in common use. These count.

> Is it all just about finding sites in the genome at the right locations which can be turned into restriction sites without disrupting any existing functional genetic elements? or is there more to it than that.

You can add or remove sites using different techniques, such as PCR mutagenesis.

> If a research team were to come along and decide they wanted to engineer their own coronavirus, how likely would it be that they would choose these restriction enzymes?

Highly likely.

> The GP comment is sort of misleading...you wouldn't just pick enzymes at random to do this analysis.

I said pick blind, not at random. I recommend reading https://info.umkc.edu/drbanderson/p-hacking-and-the-problem-...

I know what p-hacking is. There's no reason to believe that they've done that here. The choice of enzymes was motivated by the logic they outlined in the paper, the enzymes chosen are some of the most popular today, and the authors are completely forthright that the choice might affect the outcome.

To fairly make a critique like that, you need to have at least some evidence that a selection bias was applied for no other reason than to affect the p-value. Otherwise, literally every study can be accused of "p-hacking". Here, there's a very good, obvious explanation for the choice that they made, and therefore all you can really say is that the results might be different if you looked at a different set of enzymes.

I realized why this being based on simulation bothered me: this is a machine learning classifier that classifies viral genomes as synthetic or natural. The training set n = 72 (all negative, which is justifiable if you're ok with null hypothesis significance testing) the validation set n = 6 (only synthetic examples, which is less fine), and there's no test set. No effort was made to estimate true positive rate, false positive rate, etc. If this was published as a machine learning paper instead of a biology paper it would probably be held to a higher standard.
nothing stops you from gathering an (in this analysis) unseen set of wild viral genomes and known engineered ones and generate your own test set, but be sure to preregister your study and document every search query so that you can prove to the rest of the world you set yourself to the same standards as you hold others.