Hacker News new | ask | show | jobs
by sampo 4889 days ago
Genome assembly relies heavily on sequence alignment. So: Is genome assembly hard just because sequence alignment is hard? Or would genome assembly present separate algorithmic problems even if there was a super-efficient solution to sequence alignment?
3 comments

It is far more difficult than sequence alignment. Sequence alignment has quadratic complexity, while fragment assembly is NP-hard. Se for example

http://scholar.google.com/scholar?cluster=131745416915434219...

Yes, for pairwise sequence alignment. The globally optimized multiple sequence alignment problem is NP-complete.
These are different sorts of alignments, with different sorts of math behind them.

Genome assembly is the shortest common super sequence problem. It involves finding the best rearrangement and overlap of reads which minimize the overall sequence, given the expected errors in the read technology. It would still be hard even if all of the reads were perfect.

Sequence alignment looks at two or more sequences in their entirety, and does a best fit alignment using a given model of how substitutions and gaps can occur. This model may be based on chemical or evolutionary knowledge.

A "super-efficient solution to sequence alignment" doesn't lead to a way to tell how the reads should be assembled into a single large sequence, even ignoring possible read errors.

An extra difficulty with genome assembly is that DNA often has lots and lots of repeated junk sequences that can confuse the algorithms. I don't work with bioinformatics to know how they usually get around this though.
Repeats aren't necessarily junk (e.g. TAL Effectors http://en.wikipedia.org/wiki/TAL_effector#DNA_recognition). Resolving them requires long reads. PacBio is currently of interest as an alternative to Sanger sequencing for this, although the error rate of PacBio reads is a bit of an issue.
pacbio is dead, they just don't know it yet. BGI (or somebody, doesn't matter, BGI is just the obvious candidate) would need to buy 50 SMART sequencers a year just for PacBio to stay in business. That seems unlikely given the lower cost and complexity of Illumina and Life sequencers