Hacker News new | ask | show | jobs
by dcdanko 3120 days ago
The figures in this paper use pretty deceptive scales. To be clear, DeepVariant is 0.5% better than a tool built in ~2010 (GATK), on DeepVariant's best test.

GATK is still the standard, not because better variant callers don't exist, but because it's more important that everyone uses the same tool for comparisons between studies.

2 comments

That first paragraph is pretty deceptive. They are not comparing against the results from GATK 1.0.
I didn't see them mention which version they were using, presumably GATK3. I'm curious to see what it'd look like against GATK4 which is being released in a month.
But how much does a difference of 0.5% matter on this metric?
Probably not much at all. SNPs and small indels tend to be have many neighbors with which they're highly correlated. If a variant caller missed a single SNP it's likely that it still called a bunch of others that nearly always co-occur. In most cases downstream association studies would be unaffected.

It's actually possible that DeepVariant is implicitly learning some of these correlations (1). This would make it really really bad for picking out the rare persons that don't fit a trend (and tend to be very important for identifying disease loci). GATK definitely does not know about correlated SNPs.

(1) The paper implies this is not the case, saying that DeepVariant works for other genomes without retraining, but they don't show the relevant results.