This is very cool. Are there by chance any associated projects that could evolve into something like 23andme but remain entirely within a private network meaning that the data is entirely in the hands of the individual?
yes. if you wanted to annotate your genome you could “easily” do it on your brand new macbook (this is ram intensive, you probably need 32G). you’d need a reference genome, like https://www.nist.gov/programs-projects/genome-bottle
you’d likely to have to get the nanopore sequencer in the article or find a lab using Next Generation Sequencing to sequence your DNA and give you “raw data” which are usually fastq files
Could you please explain how this mapping works? Why it needs so much RAM? Is it doing a fuzzy search of sorts for known sequences (genes)? Why can't it do so one by one?
bwa specifically performs a burrows wheeler transform of a 3GB string. other mapping algorithms usually rely on some sort of indexing of the genome. the program then loads this into memory and queries that index for each “read” (a dna fragment from the dna sequencer).
Human DNA contains roughly 3.2 billion nucleotides. A 3 GB string suggests an encoding with one byte per nucleotide.
I'm curious: since there are only 4 bases in DNA, for genomic data, this seems rather inefficient. Is there any advantage in encoding the DNA with two bits per nucleotide?
It's very common to use 2 bits per nucleotide despite the human genome having ambiguous bases on top of the 4 letters. These tools typically encode the ambiguous base as a random nucleotide but keep track of them and then convert them to a random nucleotide later.
In practice BWT alignment based tools may use a forward-index and a mirror-index of the reversed genome string (not reverse complemented). This dual index approach is important for dealing with mismatches strings. There's a nice example explaining this for an older tool named Bowtie [2]
With a two bit encoding and both indices it isn't uncommon for a genome index to take up several GB of RAM. For example, BWA uses 2-3 GB for its index [3].
That's not true. I just did a high-quality sequence and assembly of a new species of fungus from my home lab using nanopore. You can see all my code used for assembly and analysis that will be referenced in a paper I plan to publish in Jan here: https://github.com/EverymanBio/pestalotiopsis
Given that the decoder is machine-learned and depends on a training set to go from squiggle -> ATGC..., how do you ensure that sequences which haven't been seen before (not in the training set) are still accurately accounted for?
We used Guppy for basecalling, which is neural network based and used to turn raw signal data into the predicted bases. There're no guarantees of accuracy, only tools to determine and assess quality. One major way of assessing accuracy is to compare the subject genome with other similar reference genomes and denote the high-degree of homology in highly-conserved regions.
My question is if in the future, we would be able to fully rely on translations to predicted bases for sequencing or if there would always be a need to compare with a different sequencing methodology in the case of de novo genetic information that previously hasn't been seen before (no reference genomes being available in that case).
Is there publicly available information on how accurate Guppy is, as well as how the amount of training data scales with improvements in accuracy?
It didn't seem like these things were mentioned explicitly in the Community Update, other than that it’s expected to continue improving, but a clearer roadmap would definitely be much more helpful.
There are quality checks throughout the entire process, starting from the raw read quality scores returned directly from the sequencer all the way to fully assembled genome completeness. In our paper, one of the tools we used for this is called BUSCO[0] which scored our assembly at 97.9%, a relatively high score for de novo assemblies.
Interested outsider here; I work with a lot of HCLS research customers but don't have a biology-related background. Can you explain the problems with the Nanopore sequencer accuracy in more detail? Basically, I was wondering if I could get one for myself and sequence my own genome, then user the data to learn about life-sciences computing techniques. If I were to buy one of the USB-attachable devices and run it, is the data simply not viable for use in a genomics pipeline, or is it just that the results would be questionable? Also, if accuracy is an issue, what about just running the same sample N times and doing some error correction?
I guess there are limits to ensemble methods if the underlying accuracy doesn't increase. I don't work on gene sequencing algorithms but from what I understand of ML ensemble techniques, there are certain assumptions regarding the underlying independence of the errors. The errors for nanopore should be uniform but I am not sure. Any molecular biologist here care to comment?
I know that the error rate of the oxford nanopore sequencer depends on GC content (guanine/cytosine nucleotides), and that the Pacific Biosciences sequencer uses a polymerase that gets worn down during reading. So there is some non-uniformity in the chemistry.
The instruments do exactly as you say (run the sample N times), but this obviously comes at a cost. Also, keep in mind that sequencing needs to be very, very accurate to be useful. We share most of our DNA, and the small variations make up all the difference.
Yes, those are all relevant costs. There's also a tradeoff between accuracy and the number of reads (how many sequences you can observe), or how much data you can get out of the machine.
Tl;Dr: Nanopore data is historically lower quality than current gold-standard methods, but it is by no means "not viable" in a genomics pipeline. Their newer chemistry flowcells are competitive with current gold-standard (but I've not seen it with my own eyes in the lab yet due to limited release).
There are two components that drive sequencing error rate. 1) The chemistry behind the sequencing (for nanopore sequencing this is the "feeding DNA through a pore" bit) 2) the method to convert raw signal into DNA sequence (this is called "base calling").
The gold-standard in terms of error profile for sequencing is currently the Illumina short read platform. Illumina machines are really just microscopes (TIRF scopes for optics folks) that sequence DNA by visualizing incorporation of dye-labeled nucleotides into the sequenced molecule(s) (Imagine a really slow PCR [1]). Each base is labeled with a different color, then when a molecule has a match it makes a colored spot on the slide that the machine can read (see here for more info & details of newer chemistry that use fewer colors [2]). This whole process is mediated by DNA polymerase which itself has a very low error rate. Another important point is that DNA sequenced on the illumina platform (called a "library") tends to be from "amplified" template DNA, meaning the DNA will have been processed and potentially be missing chemical modifications on the bases that could be present in the organism. This works to Illumina's advantage, because when trying to answer the question of "what is the DNA sequence?" we want the ground-truth DNA, not the modification state.
In contrast, Nanopore sequencing works by feeding a long strand of DNA through a pore and measuring the change in electrical current through the pore (watch the cool video [3]). For the current set of nanopore flowcells, 8 bases of DNA sit in the pore at a time, meaning the current at each timestep is a product of 8 nucleotides in aggregate. This also means that the pore "sees" each base 8 times, but always in the context of an additional 7. In order to basecall from the raw signal, it's not as easy as saying "blue = A", instead, you have to deconvolve each base from a complex signal. As you might imagine, the folks at Oxford Nanopore & broader research community have turned to machine learning-based base callers to solve this problem, and they work quite well [4]. But they are not perfect.
Deconvolving runs of the same base (e.g. "AAAAAAA") is difficult because without well-defined signal changes between bases, the caller has a hard time deciding how many bases it has seen, so a common error mode for nanopore sequencing is to create insertions/deletions at places in the genome with low nucleotide diversity. Another interesting reason is that most Nanopore library preps are often performed on unamplified DNA, and so in addition to normal A/T/G/C nucleotides, the template DNA can also contain bases with chemical modifications. For example, in bacteria, A's are often methylated, and in Humans, C can have all kinds of different modifications (5-methyl-cytosine, 5-hydroxymethyl-cytosine, etc. etc.) and each different modification affects the signal in the nanopore. Therefore, basecallers that weren't trained on modified bases will produce basecalling errors in the presence of base modifications.
For both Illumina and Nanopore basecallers, they assign a quality score to each base that indicates the probability that the basecaller produced an incorrect value. This is called a Q-score, which is defined as "Q = -10(log10(P-value))" (i.e. Q / 10 = the order of magnitude of the error probability) [5]. For example, a Q-score of 10 means an error rate of 1 in 10, but a Q-score of 50 means an error rate of 1 in 100,000. For Illumina sequencing, >95% of the reads have a Q-score > 30 (i.e. 1 in 1000 errors), while Nanopore reads tend to have lower average Q-scores (~Q20, i.e. 1 in 100 errors). For genetics, where 1 base difference can mean the difference between a severe disease allele vs a normal variant, 1 in 100 won't cut it.
The current gen Nanopore flowcell chemistry (R9.4.1) is what most people are talking about when they talk about Nanopore error rates, but they've just released a new pore type & made some basecaller upgrades that improve the accuracy to what they call "Q20+" and some claims of Q>30, and from the data I've seen, it's impressive, I just haven't got my hands on one yet to see for myself [6]. I think the comment saying "wait 5 years" is an overestimate, but if you want to genotype yourself today, I'd just pay someone for Illumina sequencing and process the fastq files yourself if you really want to do it as a learning exercise.
I've unintentionally written an essay, so I'll stop here, but real quick to your other point RE: rerunning the sample N times & using the repeats for error correction. This won't work the way you're thinking because a "sample" is actually a collection of DNA molecules that are sampled randomly by the sequencer. You have no way of knowing that the same read between runs was actually from the same molecule, so you can't error correct this way. Consequently, a totally different sequencing platform from Pacific Biosciences uses this strategy by doing some really cool chemistry, but I'll spare you the second essay (google "PacBio HiFi" or "circular consensus reads" if you're interested).
I for one am glad you wrote the essay, this was incredibly informative and filled in a bunch of blanks I had after reading what I could scratch together on the MinION product. I think I'm in a partial state of shock at how accessible this is becoming. Thank you!
Thanks - fascinating stuff. I'm now even more convinced I want to give it a try, but I think I'll play around with public data and tutorials before leaping into home sequencing.
You totally should, it's a lot of fun. I'd suggest trying to find some bacterial genome sequencing (like E. coli) done on nanopore if you're interested in those data. I don't have a link to any handy right now, otherwise I'd post here, but assembling bacterial genomes is shockingly easy these days and doesn't need near as many resources as doing a human genome, so it's great for learning (I love the assembler Flye [1] for this).
And RE: home sequencing, honestly the hardest part for a beginner will likely be the sample prep, since that takes some combination of wet lab experience and expensive equipment. I really wish molecular biology was as simple to get hacking on as writing software. The lag time between doing an experiment and getting a result is so much longer than waiting for things to compile, it just makes improving your skills take longer.
yup. that’s the business model for Illumina. it’s very much akin to video game consoles. Illumina might take a hit on selling the machine but makes it up in selling you proprietary reagents.
What sort of books/videos do you suggest so one can learn more? This stuff is interesting, and I've always seen inexpensive lab equipment on ebay.
If this can sequence flora, fungi and human DNA for about 10k - I'd buy it, just to experiment and deep dive. That is such a low barrier of entry it itself is interesting.
Cost/benefit analysis may dictate that, as other posters suggested, you'd be better served to get raw fastq files from a sequencing lab. Even better if you can send the lab a sample and they'll process the extractions for extra $$.
> and i feel like nanopore is the VR of dna sequencing. it’s always just another few years off.
Is this also true for nanopores in protein sequencing? This HN comment from a few weeks back [1] pointed out recent progress but perhaps the tech is still not quite there.
What do you mean by it's always a few years off? Nanopore will allow you to do high-quality genomic sequencing _now_, in a home lab if you wanted, for less than $3K. If you amortize the 3K by the number of genomes you can sequence on the same flow cell, the price per base or per genome falls precipitously, depending on the size of the genome of course.
ya my first thought was how hard are reagents to get, but probably not that hard. i wasn’t in the lab, i was in bioinformatics so i’m generally clueless on reagent acquisition.
Oh God, I would not want a distributed group of actors with limited trust to sequence my DNA. Maybe it's a project for close group of friends that would be interested?
I wasn't thinking sequencing but rather comparison. Could even hash data for comparison to enforce privacy (unsure how effective that would be)
But this could enable things like finding relatives which is what I got out of the comment about 23andme. Instead of all the data being centralized, storage and comparison could be distributed
Your DNA is almost exactly the same as other people's, just a unique mix.
Music is exactly the same notes, just a unique mix. So why is Sony upset that I want to stream their entire library? But jokes aside...
A few decades ago I fought the military on collecting my DNA. I stalled them long enough to get my honorable discharge and avoid that all together. It's funny you ask because the commander asked the same thing and joked "Are you afraid we are going to clone you?!" to which I replied, "No sir, you should be afraid you are going to clone me." and we both had a laugh because he knew I was right. The military are not fond of critical/free thinkers. One of me was plenty. I explained that insurance companies were already using this data to retroactively cancel peoples policies even if they were not actively afflicted by something. The commander showed me how to use the FOIA request system.
Laws have evolved a little since then but there are plenty of other risks. For starters, I can't easily change my DNA like I can change my debit card. That data can be used to tie me to others or guilt by association which is undesirable drama. It can also be used to try to sell me things. It can also be used to target biological weapons against specific groups of people. There appears to be an imbalance of data sharing in this regard. [1] Then there is simply the matter of privacy. If I want to share my DNA with some lab that is in turn going to sell it out to hundreds of other companies over and over forever, I should at very least be getting paid a vast amount of money and land and have legally binding contracts and NDA's that cover what is and is not allowed to be done with my data and how long it may be retained. That contract and the laws enforcing the contract must have some serious teeth with very serious ramifications for anyone violating it whether intentionally or by mistake.
Congrats on your navigation of the military DNA collection.
I'm more curious what the actual threat might look like.
The marginal utility of your particular genome is miniscule. Without deep phenotypic information from biophysical parameters, it is utterly impossible to learn something novel from any single genome. This makes the marginal value of the genome information very low, both to you and any attacker or user. You would not be paid much for your data even if it was sold over and over because the rates are like those for plays on Spotify.
There are not fixed differences between human populations, and there are dramatic pressures to balancing selection that keep diversity focused in key genomic regions that are critical for immune response. This is to say that it would be damn hard to target any single group with a bioweapon. And if you wanted to target a single individual with a genomically targeted bioweapon, you also have physical access, making the problem of getting genomic information without consent trivial.
People often talk about insurance risk. I suppose that's an attack vector. It's also one that can be regulated with laws and social norms. Fwiw I wonder how often this is primarily an American concern.
Imagine a public genome data repository. People donate their genomes to science and post them there for the world to use and learn from. In my opinion, it would be better for an individual to share their data than not. The reasoning is that no matter what is done with the data, the net effect will be that society learns more about the individual's particular genome than those of people who haven't contributed. This will yield better adaptation of the society to the individual. Literally this might mean that a treatment for something affecting the individual is slightly better. In expectation, the worst thing that can happen is that the individual gains more information about themselves.
I'm curious about the possible abuse scenarios given the ubiquitous use of PCR-testing for nearly two years, now.
If I'm informed correctly for a viable sample for NGS you need like 2mL saliva (which sounds little but it really takes some time: >1 min) not those trace amounts which gets usually collected by the swabs?
A very practical reason not to want your DNA out there, unrestricted, is insurance costs. From car insurance, to health insurance, to mortgage lending rates, and life insurance, and while GINA from 2008 is supposed to protect that information, there are loopholes with the interpretation of that law that should give everybody pause.
Using that analogy, all the 1s and 0s in your private key are the same as everyone else's as well. Genetic data can be used for all kinds of things, the worst of which would be things like targeted diseases or planting your DNA at a crime scene.
Actually it's like your private key is made up of ~1000 1 mb pieces that each have 1/1000 rate of difference with any other similar piece. Oh, and the order of the pieces is almost always exactly the same.
No, genomes are not "almost the same" because they are all in base-4 sequences and this made up of the same 0s 1s 2s and 3s.
We are astoundingly similar, even unusually so for a large mammalian species.
then you’d need a program like bwa http://bio-bwa.sourceforge.net/ to map your data.
then use https://samtools.github.io/bcftools/howtos/variant-calling.h... or something else to produce variants from the mapping results.
then compare your resultant vcf file to something like dbSNP: https://www.ncbi.nlm.nih.gov/snp/
at this point you can start generating a raw version of a 23andMe report.