| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by andreygrehov 2167 days ago
	In a simple English, could someone explain why is this good and what does it all mean?

3 comments

mr_overalls 2167 days ago

From the article: "Repetitive DNA sequences are common throughout the genome and have always posed a challenge for sequencing because most technologies produce relatively short "reads" of the sequence, which then have to be pieced together like a jigsaw puzzle to assemble the genome. Repetitive sequences yield lots of short reads that look almost identical, like a large expanse of blue sky in a puzzle, with no clues to how the pieces fit together or how many repeats there are. . . . Filling in the remaining gaps in the human genome sequence opens up new regions of the genome where researchers can search for associations between sequence variations and disease and for other clues to important questions about human biology and evolution."

Or, to make this more simple: Finding the complete DNA sequence of chromosomes is difficult. That's because some parts of the sequence are highly repetitive. Using a new type of lab machine, the scientists were able sequence the repetitive parts of the X chromosome. This gives a more complete picture of the X chromosome. And that can help scientists fight diseases and understand human biology better.

cco 2167 days ago

So 20 years ago when we "sequenced the human genome", we actually didn't? If you'd asked me whether or not we had a complete sequence of an X chromosome before I saw this I would have said, "Of course we have one, for over 20 years".

dekhn 2166 days ago

So much about the original announcements was overhyped PR. The original assembly was super-crappy and super-gappy. The folks running the two projects were exhausted and declared victory, then moved on.

Of all the fields that I've worked in, genomics has been one of the most overhyped (virtual drug discovery is the other) and it takes a ton of training just to understand how messed up the field is.

cco 2166 days ago

Wow, this is news to me. What a farce haha. Thanks for adding clarity here and breaking a bad assumption I had!

iskander 2167 days ago

That's right -- sequenced genomes are typically assemblies of short fragments. The assembly algorithms fail in areas of low complexity or when large sequences are repeated an unknown number of times.

virgilp 2167 days ago

So what DID we achieve 20 years ago? Like, what happened then that led to the claim of "human genome has been sequenced" and what is the additional progress that was made now?

dekhn 2166 days ago

What was announced 20 years ago was an incomplete assembly that met certain metrics that made sense at the time. https://www.nature.com/articles/35057062 is the paper. It describes the assembly as a "partial draft".

The section "Background to the Human Genome Project" gives some color on why they did what they did (TL;DR there was an ostensibly competitive race between the public project and a private one).

I ended up providing some useful resources for helping uncover just how bad genomic assemblies were (at the comptuational level): most genomic assemblies using whole genome shotgun sequencing used a number of heuristics which were believed to be correct, but I suspected that the heuristics failed to deal with repetitive regions and short sequences well. So I built a computing system with >1M xeon cores (Google Exacycle) and we provided the system to Gene Myers (who did the original WGS assembly for Celera). he used the system to do an all-vs-all comparison of sequence pairs, which found numerous bugs and problems with the heuristics that were being used. It was a huge amount of compute but the result was that myers was able to use PacBio data to assembly a significantly better genome, faster, on a laptop: (https://www.yuzuki.org/favorite-talk-agbt-2014-gene-myers-ma...)

iskander 2166 days ago

Ultimately, is there anything you could do with repetitive regions? Like, were the problems with the heuristics or just a mismatch between ~100 base reads and multi-kb repeats?

technobabble 2166 days ago

For a good read, I recommend The Gene: An Intimate History by Siddhartha Mukherjee

derefr 2167 days ago

> Repetitive sequences yield lots of short reads that look almost identical, like a large expanse of blue sky in a puzzle, with no clues to how the pieces fit together or how many repeats there are

Ah, https://en.wikipedia.org/wiki/Clock_recovery !

Too bad DNA isn't a run-length limited code. (Wouldn't that be something.)

haihaibye 2166 days ago

DNA is a code, but it's not just a code. It is also a really long molecule that has to bend and fold up in certain ways.

There are error detecting codes, in a way. Protein is encoded by 3 base codes, and if you insert or delete bases not in a multiple of 3 it will be misaligned, then eventually likely encode a stop code and cause the bad protein to be truncated and likely removed via nonsense mediated decay.

hoseja 2167 days ago

It's crazy to me that we still haven't hacked RNA polymerase or some other such "obvious" method to just linearly read all of the strand. The machinery is all there, by definition!

jdale27 2167 days ago

Most sequencing methods do use DNA polymerase to effectively copy the DNA while reading out the bases that are being incorporated. However, even when the genome is being replicated in vivo, the polymerase doesn't just copy the entire chromosome in one go from end to end, for various reasons. For example, for larger genomes such as those of animals, it would just take too long to copy it that way. But even for small genomes such as bacteria, DNA replication is still naturally done in fragments. One simple reason is that at any given time the polymerase can dissociate from the template and have to reattach. Even if you could engineer the polymerase to bind more tightly, you'd have to deal with the tradeoff between binding strength, replication rate, and error rate (i.e., a more tightly binding polymerase would likely copy more slowly).

rcthompson 2167 days ago

Imagine you have a bunch of aerial photographs that you're trying to assemble into one big mosaic of the entire region by matching up the overlaps on the edges. The problem is, this is a desert, and significant portions of the region are just flat expanses of empty sand, so all the aerial photographs from those regions look pretty much identical. Even worse, there's several of these flat sandy regions in the area, so you don't even know which region those photos came from.

So despite having taken multiple photos of every square inch of land in your target area, there's no way you can assemble them into one big image just by matching up the overlaps. Without a source of larger-scale information about the region, like a satellite photograph or GPS coordinates for the photos, you have no way of knowing how wide that desert is. All you know is that it's wider than one or two photographs.

This is essentially same problem that current genome assemblies have: there are regions of repetitive sequence in the genome, so all the sequencing reads from those regions look identical to each other, just like the photographs of flat sandy desert, and there's no way to tell how they're supposed to overlap to form the full sequence. The only way to resolve these regions is with a technology that can read all the way through from one end to the other without stopping, producing a single contiguous sequence.

The link here describes the fruits of an effort using exactly those sorts of long-read technologies to fill in all the gaps in the X chromosome sequence, thus generating a single contiguous sequence from end to end, something that hasn't previously been possible for DNA sequences of this size.

As to why this is important, these repetitive sequences, despite being apparently featureless, still sometimes have important effects (not unlike the apparently dead and featureless desert in the analogy). In addition, sometimes there are "oases" of functionally important non-repetitive DNA sequence within the "desert" of repetition, and previous genome assembly methods would not be able to tell where these oases belonged. All of this is important because many functional DNA elements are cis-acting. That is, they exert effects on genes that are nearby on the genome. So if you don't know where they belong, then you don't know what they're doing.

If you can assemble one big chromosome sequence from end to end, all of the above problems go away, and you can finally get on with the analysis you wanted to do anyway and stop worrying about not being able to calculate meaningful distances between DNA elements.

avancemos 2167 days ago

Having a complete and accurate map is important for researchers who study the X-chromosome.

Long sequences reads allowed them to map the highly repetitive chromosome. Most sequencing is done by high-throughput short reads.

The technology can theoretically be used to map other regions of the genome which are highly repetitive.