Hacker News new | ask | show | jobs
by comstock 3064 days ago
It’s also comparatively expensive compared to other platforms (you can get a full human genome sequenced at high coverage for between 1000 and 3000 USD).

The error rate is stupidly high (somewhere between 10 and 20%) compared to Illumina or Ion Torrent who give error rates far less than 1%.

It can give very long reads, which are useful in some niche applications. But it’s been massively over-hyped (and over capitalized).

The neat thing is that it’s very small. But that isn’t really compelling given the very low accuracy.

3 comments

> you can get a full human genome sequenced at high coverage for between 1000 and 3000 USD

This is not a "full" human genome, but a collection of 150bp fragments that can be realigned to an existing human genome. You cannot take this and infer the whole diploid genome of the individual. There is a huge amount that will be missed, and all of our current knowledge is based on this gappy picture of what's going on in single genomes and human populations.

> It can give very long reads, which are useful in some niche applications. But it’s been massively over-hyped (and over capitalized).

I think you're dismissing the technology out of hand because of biases derived from much more limited short-read technology that only allows us to reliably see small variants <50bp.

Without these long reads we can't see structural variation (SVs). There is an increasing amount of evidence that much of adaptive variation is driven by these kinds of variants. If you want recent evidence, see https://www.nature.com/articles/s41588-017-0010-y. There has long been evidence that there are huge copy number variations in humans, but these are still not evaluated reliably: http://science.sciencemag.org/content/330/6004/641.

We should be open to the possibility that our observational techniques are limiting our understanding how how genomes work. This has consistently occurred in the history of every observationally-driven science.

It's amusing to me that people assume that SVs are "niche" when even the limited surveys of genomes we've been able to do with short reads show that roughly an equal number of base pairs in the human population vary due to small variants like SNPs and indels and big ones like deletions, insertions, and large scale copy number variation: http://science.sciencemag.org/content/330/6004/641

I think you are not sufficiently recognising how much structural variation can be resolved from short reads. There is certainly some that can't but a large proportion can be with the right tools.
I've participated in several large projects that worked to detect SVs from short reads in humans. The results, which remain best of class, are simply disappointing. A tiny fraction of the variants detected were actually resolvable to near-base pair resolution. The vast majority were described in approximate terms, using estimates of breakpoints and allelic structure.

Most structural variation I've seen based on whole genome assemblies is not even classifiable into neat categories like "deletion" or "insertion". If you think that "most" things are detected with short reads then you are deluded by the dominant technology.

"No human genome has ever been completely sequenced" https://news.ycombinator.com/item?id=15534325 There are a lot of gaps where the amount of repetition makes it impossible to reassemble a complete picture from tiny fragments.
I’d agree with you, that long reads would be useful if the error rate wasn’t so shockingly bad.

There is, likely value in long reads, but what non-niche research applications are there for highly error’d reads that justify a valuation of several billion dollars?

Virtually all applications can benefit from long reads. There are already hybrid assemblers out there which take Illumina, Pacbio and Nanopore reads. The long reads tie the short reads together, whereas the short reads improve the accuracy.

The area where DNA sequencing will first be revolutionizing clinical practice is in sequencing pathogens for sake of identification. In these instances nanopore sequencing rules, because it can give answers in minutes.

Most clinical applications don’t need long reads. Pathogen identification from short reads is easy. Blood tests for cancer, and NIPT (which will likely be the first big applications) both use fragmented DNA in the blood, so long reads are not useful. Depth (lots of sequencing) and quality are far more important.
It's worth noting that those clinical applications were developed when technology didn't allow long reads, so "clinical applications don't need long reads" is at present a truism. There may be potential applications that require long reads that simply couldn't have been invented yet (albeit I haven't the slightest what those would be.)
Yes, but I would say quality is most important in almost all cases. Well, quality being defined as <1% error rate, which isn’t such a high bar.

The most compelling near term applications (NITP etc) use fragmented DNA, and long reads will have no benefit here.

So, yes. Long reads are useful, but you need to have at least reasonable performance in other respects. The same thing has been seen with PacBio, who have not played well in the market, despite having a read length advantage.

How long does it take to get the answer? Even if a big, expensive short read sequencing machine is in the building, it still takes a day or two to reach the necessary data.

With sepsis, every hour counts.

The per base error rate is bad. In the case of pacbio, this error process approximates white noise, and so you can deal with it perfectly by increasing read coverage. Things are somewhat complicated with the nanopore tech described in this post, as errors may be correlated due to the way the basecalling is done, but in practice it's nearly as big a problem as you think it is.

For things approaching a read length the per-base error rate of a single read is simply irrelevant. In practice, with sufficient coverage (e.g. 20x) you simply don't care about the per base error rate of the reads.

That might be the case if the throughput wasn’t so low and the error rate wasn’t so high.
I feel like there is or was an assumption they'd be able to improve their tech
It's fine for bacteria where even with 10-20% error, you can do consensus calling with very high (200-500X) coverage.
>The error rate is stupidly high (somewhere between 10 and 20%)

The Insertion/deletion error rate is 20-30%.

The point mutation error rate is something 0.1-1% (higher than HiSeq but not crazy high).

This means with a semi-decent reference genome you should be able to do re-sequencing fairly accurately. It also means, that in conjunction with HiSeq reads you can do cheap genome assembly, using the HiSeq reads for coverage, and the minion reads for scaffolding.

Do you have a citation for this? Because this is not my understanding.
I can add to this with my anecdotal evidence. In my experience (looked at ~15gb of basecalled data total), there is a large amount of indel errors.

The mismatch rate is much lower. But it's hard to calculate exactly the mismatch rate when the indel rate is so high.

No citation, just personal experience aligning the data and comparing it directly to a hiSeq run of the same DNA.

I was able to get about 1-10% mutation rate, with a median of about 1.5%. Rate depending on quality of the run. In general it was on par with PacBio.