Hacker News new | ask | show | jobs
by eggie 3065 days ago
> you can get a full human genome sequenced at high coverage for between 1000 and 3000 USD

This is not a "full" human genome, but a collection of 150bp fragments that can be realigned to an existing human genome. You cannot take this and infer the whole diploid genome of the individual. There is a huge amount that will be missed, and all of our current knowledge is based on this gappy picture of what's going on in single genomes and human populations.

> It can give very long reads, which are useful in some niche applications. But it’s been massively over-hyped (and over capitalized).

I think you're dismissing the technology out of hand because of biases derived from much more limited short-read technology that only allows us to reliably see small variants <50bp.

Without these long reads we can't see structural variation (SVs). There is an increasing amount of evidence that much of adaptive variation is driven by these kinds of variants. If you want recent evidence, see https://www.nature.com/articles/s41588-017-0010-y. There has long been evidence that there are huge copy number variations in humans, but these are still not evaluated reliably: http://science.sciencemag.org/content/330/6004/641.

We should be open to the possibility that our observational techniques are limiting our understanding how how genomes work. This has consistently occurred in the history of every observationally-driven science.

It's amusing to me that people assume that SVs are "niche" when even the limited surveys of genomes we've been able to do with short reads show that roughly an equal number of base pairs in the human population vary due to small variants like SNPs and indels and big ones like deletions, insertions, and large scale copy number variation: http://science.sciencemag.org/content/330/6004/641

2 comments

I think you are not sufficiently recognising how much structural variation can be resolved from short reads. There is certainly some that can't but a large proportion can be with the right tools.
I've participated in several large projects that worked to detect SVs from short reads in humans. The results, which remain best of class, are simply disappointing. A tiny fraction of the variants detected were actually resolvable to near-base pair resolution. The vast majority were described in approximate terms, using estimates of breakpoints and allelic structure.

Most structural variation I've seen based on whole genome assemblies is not even classifiable into neat categories like "deletion" or "insertion". If you think that "most" things are detected with short reads then you are deluded by the dominant technology.

"No human genome has ever been completely sequenced" https://news.ycombinator.com/item?id=15534325 There are a lot of gaps where the amount of repetition makes it impossible to reassemble a complete picture from tiny fragments.
I’d agree with you, that long reads would be useful if the error rate wasn’t so shockingly bad.

There is, likely value in long reads, but what non-niche research applications are there for highly error’d reads that justify a valuation of several billion dollars?

Virtually all applications can benefit from long reads. There are already hybrid assemblers out there which take Illumina, Pacbio and Nanopore reads. The long reads tie the short reads together, whereas the short reads improve the accuracy.

The area where DNA sequencing will first be revolutionizing clinical practice is in sequencing pathogens for sake of identification. In these instances nanopore sequencing rules, because it can give answers in minutes.

Most clinical applications don’t need long reads. Pathogen identification from short reads is easy. Blood tests for cancer, and NIPT (which will likely be the first big applications) both use fragmented DNA in the blood, so long reads are not useful. Depth (lots of sequencing) and quality are far more important.
It's worth noting that those clinical applications were developed when technology didn't allow long reads, so "clinical applications don't need long reads" is at present a truism. There may be potential applications that require long reads that simply couldn't have been invented yet (albeit I haven't the slightest what those would be.)
Yes, but I would say quality is most important in almost all cases. Well, quality being defined as <1% error rate, which isn’t such a high bar.

The most compelling near term applications (NITP etc) use fragmented DNA, and long reads will have no benefit here.

So, yes. Long reads are useful, but you need to have at least reasonable performance in other respects. The same thing has been seen with PacBio, who have not played well in the market, despite having a read length advantage.

How long does it take to get the answer? Even if a big, expensive short read sequencing machine is in the building, it still takes a day or two to reach the necessary data.

With sepsis, every hour counts.

The per base error rate is bad. In the case of pacbio, this error process approximates white noise, and so you can deal with it perfectly by increasing read coverage. Things are somewhat complicated with the nanopore tech described in this post, as errors may be correlated due to the way the basecalling is done, but in practice it's nearly as big a problem as you think it is.

For things approaching a read length the per-base error rate of a single read is simply irrelevant. In practice, with sufficient coverage (e.g. 20x) you simply don't care about the per base error rate of the reads.

That might be the case if the throughput wasn’t so low and the error rate wasn’t so high.
I feel like there is or was an assumption they'd be able to improve their tech