Hacker News new | ask | show | jobs
by jpeloquin 1334 days ago
I realized why this being based on simulation bothered me: this is a machine learning classifier that classifies viral genomes as synthetic or natural. The training set n = 72 (all negative, which is justifiable if you're ok with null hypothesis significance testing) the validation set n = 6 (only synthetic examples, which is less fine), and there's no test set. No effort was made to estimate true positive rate, false positive rate, etc. If this was published as a machine learning paper instead of a biology paper it would probably be held to a higher standard.
1 comments

nothing stops you from gathering an (in this analysis) unseen set of wild viral genomes and known engineered ones and generate your own test set, but be sure to preregister your study and document every search query so that you can prove to the rest of the world you set yourself to the same standards as you hold others.