Hacker News new | ask | show | jobs
by alextheparrot 3434 days ago
I can address some of these.

Sequencing today is done mostly using computational methods. Think of DNA as a couple long strings (Number of bases is effectively the character count of those strings, each string is a "chromosome" in higher organisms), so the problem is how do we read these long, physical strings. It turns out that parallel processing is way more effective, so we break the really long strings into much, much smaller strings that overlap (Millions of characters long to hundreds often). Because the strings overlap, we can construct a good portion of the actual sequence computationaly by exploiting this overlapping feature of our small strings.

The physical way they do this is by using machines (Think GPU vs CPU) that are effectively a bunch of parallel microscopes specialized to read those short strings and by "attaching" colors to each of the characters (DNA bases). Initial DNA sequencing methods lacked both the computational and physical devices to do this, so they were done by hand. The move from doing sequencing by hand to doing it computationally is why we see the significant increase in characters read (Number of bases).

Your last comment I think is the most interesting, as it effectively asks "Why do mice have a larger string size than us, which means they contain more information on an absolute level?". The answer is just because. The number of bases, or even the number of blocks of information that produce proteins (These blocks are called genes, and a protein is another chemical construct that mainly focuses on doing actions in the cell), is not strongly correlated with the complexity of the organism. The key is how those bases interact, not necessarily in how many there are.

If you have any more questions or need some clarification I'd love to address them, it is a wonderful time to be alive.

3 comments

Thanks for volunteering to answer some questions.

1. What is the (maximum) range of read lengths that modern gene sequencers can produce? Any timeline on when those read lengths will increase substantially?

2. How do bioinformatics people contend with repetitive genomic regions?

3. Are there any differences in how gene sequencing technology works on DNA from different species? For example, does an approach that works on humans (e.g. gene sequence alignment or de novo assembly) work on something like wheat?

1. Depends on the technology. On Illumina (cheapest tech and highest throughput), you get the first and last 125 bases of smallish DNA molecules with an acceptable error rate. Pacific Biosciences (lower throughput and more expensive) gives you up to 40.000 bases with a rather horrible error rate.

2. They fail epically. There is nothing you can do computationally. With paired end reads (two reads at an approximately known distance), you still can't assemble repetitive regions, but you can get the contigs around the repeat in the right order.

3. Definitely, but I don't know the details. Plants are often more difficult than animals; they have bigger genomes and often have multiple chromosome sets. Assembly of a wheat genome is more difficult than assembly of the human genome---and I'd argue even the latter isn't actually a solved problem.

This is really interesting. For someone who knows nothing about the subject, how were DNA strands physically read at a low level before computational methods? I was under the impression DNA is too small to see without an electron microscope. You mention reading dna by hand, and I'm really interested in how that is done.
Given a string, I can easily discern one characteristic, which is length. That's because the length of the string is tied to how "massive" it is and thus when I push on things that are more massive they move more slowly. That's the general idea behind gel separation.

Now, I just need a way to make all the combinations of substrings starting from the first position (0 => 1, 0 => 2, etc.). This is a bit more difficult to explain and chemically intensive, but let's assume for each character (C) we have another character (C') that is pretty much the same thing. The key difference, however, is that C' is marked (Radioactivity or with something that lights up) AND that it doesn't allow any more characters to be added on. If each distinct C' is a different color, we can now distinguish between our different substrings, based entirely on the last character. We know that our strings are ordered by size, so we can construct our original sequence based on the terminal member of the substrings.

You can imagine this process being done by hand, it works for that. However, it doesn't scale well to the millions and billions of base pairs we need in the modern day.

As a fun aside, protein sequences were originally determined in a way pretty much the inverse of this. For a given protein string, remove the first element with chemistry. Then, try to figure out what you removed. Now take your string of size N - 1 and repeat, until you have determined each character. This method ended up not being tractable for DNA because of chemical differences. Also, a lot of protein sequencing is done in a similar way to DNA sequencing, in that we break up, shatter may be a better word, the protein. We then try to construct the original protein based on how is shatters (Like reconstructing a window based on knowing where the pieces fall and where the baseball came from).

That depends on the technology. The technology I most often work with will have lots of fake DNA basepairs in a soup, which has the real subject's DNA broken apart into fragments and attached to a substrate to keep it from moving. The fake DNA basepairs bind to the complementary real basepairs and emit fluorescent light when doing so. En masse, the fluorescent light gets captured on camera and each of the possible basepairs' colors are scored, to provide a call for that basepair. Repeat the cycle a few times, map the lights to the same spots, and you generate a sequence of basepair 'reads' which are then sent further down a software pipeline for later analysis.

...the later steps in the pipeline involve using lots of complex math to reassemble the sequenced fragments back together, either using a reference assembly (such as the Human Genome Project) or else de-novo assembly (which basically _builds_ a reference assembly through lots and lots and lots of effort).

There are other technologies as well which I'm not so familiar with.

Look up Sanger sequencing.
Just wanted to say thanks for answering in such detail. Your string analogy really helped me understand it quite easily.