From the article: "Repetitive DNA sequences are common throughout the genome and have always posed a challenge for sequencing because most technologies produce relatively short "reads" of the sequence, which then have to be pieced together like a jigsaw puzzle to assemble the genome. Repetitive sequences yield lots of short reads that look almost identical, like a large expanse of blue sky in a puzzle, with no clues to how the pieces fit together or how many repeats there are. . . . Filling in the remaining gaps in the human genome sequence opens up new regions of the genome where researchers can search for associations between sequence variations and disease and for other clues to important questions about human biology and evolution."
Or, to make this more simple: Finding the complete DNA sequence of chromosomes is difficult. That's because some parts of the sequence are highly repetitive. Using a new type of lab machine, the scientists were able sequence the repetitive parts of the X chromosome. This gives a more complete picture of the X chromosome. And that can help scientists fight diseases and understand human biology better.
So 20 years ago when we "sequenced the human genome", we actually didn't? If you'd asked me whether or not we had a complete sequence of an X chromosome before I saw this I would have said, "Of course we have one, for over 20 years".
So much about the original announcements was overhyped PR. The original assembly was super-crappy and super-gappy. The folks running the two projects were exhausted and declared victory, then moved on.
Of all the fields that I've worked in, genomics has been one of the most overhyped (virtual drug discovery is the other) and it takes a ton of training just to understand how messed up the field is.
That's right -- sequenced genomes are typically assemblies of short fragments. The assembly algorithms fail in areas of low complexity or when large sequences are repeated an unknown number of times.
So what DID we achieve 20 years ago? Like, what happened then that led to the claim of "human genome has been sequenced" and what is the additional progress that was made now?
What was announced 20 years ago was an incomplete assembly that met certain metrics that made sense at the time.
https://www.nature.com/articles/35057062 is the paper. It describes the assembly as a "partial draft".
The section "Background to the Human Genome Project" gives some color on why they did what they did (TL;DR there was an ostensibly competitive race between the public project and a private one).
I ended up providing some useful resources for helping uncover just how bad genomic assemblies were (at the comptuational level): most genomic assemblies using whole genome shotgun sequencing used a number of heuristics which were believed to be correct, but I suspected that the heuristics failed to deal with repetitive regions and short sequences well. So I built a computing system with >1M xeon cores (Google Exacycle) and we provided the system to Gene Myers (who did the original WGS assembly for Celera). he used the system to do an all-vs-all comparison of sequence pairs, which found numerous bugs and problems with the heuristics that were being used. It was a huge amount of compute but the result was that myers was able to use PacBio data to assembly a significantly better genome, faster, on a laptop: (https://www.yuzuki.org/favorite-talk-agbt-2014-gene-myers-ma...)
Ultimately, is there anything you could do with repetitive regions? Like, were the problems with the heuristics or just a mismatch between ~100 base reads and multi-kb repeats?
> Repetitive sequences yield lots of short reads that look almost identical, like a large expanse of blue sky in a puzzle, with no clues to how the pieces fit together or how many repeats there are
DNA is a code, but it's not just a code. It is also a really long molecule that has to bend and fold up in certain ways.
There are error detecting codes, in a way. Protein is encoded by 3 base codes, and if you insert or delete bases not in a multiple of 3 it will be misaligned, then eventually likely encode a stop code and cause the bad protein to be truncated and likely removed via nonsense mediated decay.
It's crazy to me that we still haven't hacked RNA polymerase or some other such "obvious" method to just linearly read all of the strand. The machinery is all there, by definition!
Most sequencing methods do use DNA polymerase to effectively copy the DNA while reading out the bases that are being incorporated. However, even when the genome is being replicated in vivo, the polymerase doesn't just copy the entire chromosome in one go from end to end, for various reasons. For example, for larger genomes such as those of animals, it would just take too long to copy it that way. But even for small genomes such as bacteria, DNA replication is still naturally done in fragments. One simple reason is that at any given time the polymerase can dissociate from the template and have to reattach. Even if you could engineer the polymerase to bind more tightly, you'd have to deal with the tradeoff between binding strength, replication rate, and error rate (i.e., a more tightly binding polymerase would likely copy more slowly).
Imagine you have a bunch of aerial photographs that you're trying to assemble into one big mosaic of the entire region by matching up the overlaps on the edges. The problem is, this is a desert, and significant portions of the region are just flat expanses of empty sand, so all the aerial photographs from those regions look pretty much identical. Even worse, there's several of these flat sandy regions in the area, so you don't even know which region those photos came from.
So despite having taken multiple photos of every square inch of land in your target area, there's no way you can assemble them into one big image just by matching up the overlaps. Without a source of larger-scale information about the region, like a satellite photograph or GPS coordinates for the photos, you have no way of knowing how wide that desert is. All you know is that it's wider than one or two photographs.
This is essentially same problem that current genome assemblies have: there are regions of repetitive sequence in the genome, so all the sequencing reads from those regions look identical to each other, just like the photographs of flat sandy desert, and there's no way to tell how they're supposed to overlap to form the full sequence. The only way to resolve these regions is with a technology that can read all the way through from one end to the other without stopping, producing a single contiguous sequence.
The link here describes the fruits of an effort using exactly those sorts of long-read technologies to fill in all the gaps in the X chromosome sequence, thus generating a single contiguous sequence from end to end, something that hasn't previously been possible for DNA sequences of this size.
As to why this is important, these repetitive sequences, despite being apparently featureless, still sometimes have important effects (not unlike the apparently dead and featureless desert in the analogy). In addition, sometimes there are "oases" of functionally important non-repetitive DNA sequence within the "desert" of repetition, and previous genome assembly methods would not be able to tell where these oases belonged. All of this is important because many functional DNA elements are cis-acting. That is, they exert effects on genes that are nearby on the genome. So if you don't know where they belong, then you don't know what they're doing.
If you can assemble one big chromosome sequence from end to end, all of the above problems go away, and you can finally get on with the analysis you wanted to do anyway and stop worrying about not being able to calculate meaningful distances between DNA elements.
Or, to make this more simple: Finding the complete DNA sequence of chromosomes is difficult. That's because some parts of the sequence are highly repetitive. Using a new type of lab machine, the scientists were able sequence the repetitive parts of the X chromosome. This gives a more complete picture of the X chromosome. And that can help scientists fight diseases and understand human biology better.