| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by xyhopguy 2050 days ago

theyre not really similar problem, genome assembly is a problem that is most effectively solved in the lab and with traditional graph algorithms. protein folding fits very nicely into structured prediction and has a very quantitative way of measuring performance. Genome assembly is more qualitative, to say the least.

given a bunch of reads of short(50-300bp) or long (1000-100,000 bp) lengths (we use different algorithms for each) you need to resolve a 1D sequence that holds all of those continguous subsequences for each chromosome. Genome assembly is often described as finding a hamiltonian cycle among the data. For short reads, we use debrujin graphs to avoid the hardness of a complete hamiltonian. for long reads, MinHash has become a popular heuristic.

But this can be tricky, for instance, in species that isn't haploid -- you actually have two possible genomes that you need to correctly assemble. Sometimes the difference is a base or two, but sometimes its much longer stretches that can appear as large 'bubbles' in the assembly graph.

The assembly problem can be harder or easier depending on the organism, for instance, Wheat is notoriously hard to assemble due to the fact that it has 3 mostly (but not completely) distinct genomes. The Norway Spruce is composed of ~80B basepairs, which will put strain on even the biggest machines. Oh and its mostly repeats.

Repeats are _everywhere_, and they can be really long (sometimes on the order of millions of nucleotides). Also -- depending on the species youre assembling, the repeat content and the kind of repeats can change. You also can get different errors from the library preparation, again related to the species of origin.

Using lab techniques is particularly helpful, as we can leverage molecular and genetic information. BACs (baceterial artificial choromosomes) can be used to break the genome into smaller continguous chunks and to act as 'anchors' for the larger assembly problem. Recombination rates of different SNPs can be used to infer spatially local segments of DNA, and optical tags can be added and imaged to provide physical anchors for certain sequences. Some organisms can be bred into pure lines, with limited heterozygosity. Expressed RNA transcripts can be used in a similar manner, as they are the concatenation of ordered exons. Combining these methods is typically the best way to get a good genome assembly.

Some of the bubble resolution heuristics could probably be improved with deep learning, but getting the right data for that is probably more effective with traditional graph algorithms. Also -- you really only need a good genome assembly once, and sometimes it doesnt even need to be all that good.

Genome assembly is a _really_ fascinating area of research, and honestly, the right direction is probably to represent genomes as something other than 1D sequence that more accurately reflects biology. Deep learning is still looking to make its mark on genomics, and unfortunately its not well suited for much of the discovery efforts such as genome assembly.

2 comments

jakobnissen 2048 days ago

What a great write-up. I used to work in metagenomics, where we would assemble environments with hundreds of bacterial genomes. We developed a deep-learning approach to binning the assembly, Vamb, which worked really well. In hindsight, our approach to deep learning was quite naive, so I can easily see more skilled application of DL completely outclass all existing approaches to binning. So while DL indeed would have a hard time with assembly, I'm not sure that's the case for genomics as such.

link

acmj 2050 days ago

No practical assemblers take assembly as a Hamiltonian problem.

link

xyhopguy 2050 days ago

many long read assemblers (good ones, at that) treat it as a hamiltonian problem.

link

acmj 2050 days ago

No, not even a single one of them. Read Gene Myers paper in 1995 or 2005. Modern OLC assemblers all follow that route which has nothing to do with the Hamilton problem. Equating overlap based assembly to a Hamilton problem is the biggest lie in the field of sequence assembly. Please stop spreading that.

link