Hacker News new | ask | show | jobs
by rcthompson 2162 days ago
Imagine you have a bunch of aerial photographs that you're trying to assemble into one big mosaic of the entire region by matching up the overlaps on the edges. The problem is, this is a desert, and significant portions of the region are just flat expanses of empty sand, so all the aerial photographs from those regions look pretty much identical. Even worse, there's several of these flat sandy regions in the area, so you don't even know which region those photos came from.

So despite having taken multiple photos of every square inch of land in your target area, there's no way you can assemble them into one big image just by matching up the overlaps. Without a source of larger-scale information about the region, like a satellite photograph or GPS coordinates for the photos, you have no way of knowing how wide that desert is. All you know is that it's wider than one or two photographs.

This is essentially same problem that current genome assemblies have: there are regions of repetitive sequence in the genome, so all the sequencing reads from those regions look identical to each other, just like the photographs of flat sandy desert, and there's no way to tell how they're supposed to overlap to form the full sequence. The only way to resolve these regions is with a technology that can read all the way through from one end to the other without stopping, producing a single contiguous sequence.

The link here describes the fruits of an effort using exactly those sorts of long-read technologies to fill in all the gaps in the X chromosome sequence, thus generating a single contiguous sequence from end to end, something that hasn't previously been possible for DNA sequences of this size.

As to why this is important, these repetitive sequences, despite being apparently featureless, still sometimes have important effects (not unlike the apparently dead and featureless desert in the analogy). In addition, sometimes there are "oases" of functionally important non-repetitive DNA sequence within the "desert" of repetition, and previous genome assembly methods would not be able to tell where these oases belonged. All of this is important because many functional DNA elements are cis-acting. That is, they exert effects on genes that are nearby on the genome. So if you don't know where they belong, then you don't know what they're doing.

If you can assemble one big chromosome sequence from end to end, all of the above problems go away, and you can finally get on with the analysis you wanted to do anyway and stop worrying about not being able to calculate meaningful distances between DNA elements.