Hacker News new | ask | show | jobs
by mr_overalls 2167 days ago
From the article: "Repetitive DNA sequences are common throughout the genome and have always posed a challenge for sequencing because most technologies produce relatively short "reads" of the sequence, which then have to be pieced together like a jigsaw puzzle to assemble the genome. Repetitive sequences yield lots of short reads that look almost identical, like a large expanse of blue sky in a puzzle, with no clues to how the pieces fit together or how many repeats there are. . . . Filling in the remaining gaps in the human genome sequence opens up new regions of the genome where researchers can search for associations between sequence variations and disease and for other clues to important questions about human biology and evolution."

Or, to make this more simple: Finding the complete DNA sequence of chromosomes is difficult. That's because some parts of the sequence are highly repetitive. Using a new type of lab machine, the scientists were able sequence the repetitive parts of the X chromosome. This gives a more complete picture of the X chromosome. And that can help scientists fight diseases and understand human biology better.

3 comments

So 20 years ago when we "sequenced the human genome", we actually didn't? If you'd asked me whether or not we had a complete sequence of an X chromosome before I saw this I would have said, "Of course we have one, for over 20 years".
So much about the original announcements was overhyped PR. The original assembly was super-crappy and super-gappy. The folks running the two projects were exhausted and declared victory, then moved on.

Of all the fields that I've worked in, genomics has been one of the most overhyped (virtual drug discovery is the other) and it takes a ton of training just to understand how messed up the field is.

Wow, this is news to me. What a farce haha. Thanks for adding clarity here and breaking a bad assumption I had!
That's right -- sequenced genomes are typically assemblies of short fragments. The assembly algorithms fail in areas of low complexity or when large sequences are repeated an unknown number of times.
So what DID we achieve 20 years ago? Like, what happened then that led to the claim of "human genome has been sequenced" and what is the additional progress that was made now?
What was announced 20 years ago was an incomplete assembly that met certain metrics that made sense at the time. https://www.nature.com/articles/35057062 is the paper. It describes the assembly as a "partial draft".

The section "Background to the Human Genome Project" gives some color on why they did what they did (TL;DR there was an ostensibly competitive race between the public project and a private one).

I ended up providing some useful resources for helping uncover just how bad genomic assemblies were (at the comptuational level): most genomic assemblies using whole genome shotgun sequencing used a number of heuristics which were believed to be correct, but I suspected that the heuristics failed to deal with repetitive regions and short sequences well. So I built a computing system with >1M xeon cores (Google Exacycle) and we provided the system to Gene Myers (who did the original WGS assembly for Celera). he used the system to do an all-vs-all comparison of sequence pairs, which found numerous bugs and problems with the heuristics that were being used. It was a huge amount of compute but the result was that myers was able to use PacBio data to assembly a significantly better genome, faster, on a laptop: (https://www.yuzuki.org/favorite-talk-agbt-2014-gene-myers-ma...)

Ultimately, is there anything you could do with repetitive regions? Like, were the problems with the heuristics or just a mismatch between ~100 base reads and multi-kb repeats?
most people have moved to techniques that produce longer reads with higher error rates. good coverage + longer reads overcomes the higher error rate. You can read the DALIGN paper by Myers to learn more. Or read https://dazzlerblog.wordpress.com/author/thegenemyers/
For a good read, I recommend The Gene: An Intimate History by Siddhartha Mukherjee
> Repetitive sequences yield lots of short reads that look almost identical, like a large expanse of blue sky in a puzzle, with no clues to how the pieces fit together or how many repeats there are

Ah, https://en.wikipedia.org/wiki/Clock_recovery !

Too bad DNA isn't a run-length limited code. (Wouldn't that be something.)

DNA is a code, but it's not just a code. It is also a really long molecule that has to bend and fold up in certain ways.

There are error detecting codes, in a way. Protein is encoded by 3 base codes, and if you insert or delete bases not in a multiple of 3 it will be misaligned, then eventually likely encode a stop code and cause the bad protein to be truncated and likely removed via nonsense mediated decay.

It's crazy to me that we still haven't hacked RNA polymerase or some other such "obvious" method to just linearly read all of the strand. The machinery is all there, by definition!
Most sequencing methods do use DNA polymerase to effectively copy the DNA while reading out the bases that are being incorporated. However, even when the genome is being replicated in vivo, the polymerase doesn't just copy the entire chromosome in one go from end to end, for various reasons. For example, for larger genomes such as those of animals, it would just take too long to copy it that way. But even for small genomes such as bacteria, DNA replication is still naturally done in fragments. One simple reason is that at any given time the polymerase can dissociate from the template and have to reattach. Even if you could engineer the polymerase to bind more tightly, you'd have to deal with the tradeoff between binding strength, replication rate, and error rate (i.e., a more tightly binding polymerase would likely copy more slowly).