|
|
|
|
|
by epistasis
5607 days ago
|
|
You have just been shown an amazing invention: the BWT. Rather than dismiss it and the entire field of performance obsessed people that find it more useful than any hash based algorithm, perhaps you should study it. The speculation in your comment is laughable, from a cluster costing double digit dollar difference over a single machine, to sequence data being used in forensic DNA. Sure, you throw around some introductory jargon that may fool an outsider into thinking you know what you're talking about, and that's why I'm bothering to comment so that no one is deceived. BWT =~ suffix tree with far far greater memory efficiency |
|
Essentially, any alignment technique that can come close enough to I/O saturating reading both the dominant factor (the reads) and the minor factor (the reference) is optimal.
When (for a slightly different kind of gapped, error-prone read) I wrote alignment software, I also beat MAQ by 10-20X, like the BWA algorithm but without using anything like the BWT hair. I contemplated how the gapping rules and error rules looked in a brute force NFA translation. I found simple ways to optimize that NFA very cheaply. I found that it paid to maximize read-data throughput by devoting all of memory to the NFA built from as many reads as possible. It was not hard, this way, to get in the ballpark (small enough constant factor) of I/O bounding read processing (where "not hard" means something like 3-6 months of staring at walls and scribbling stuff on paper until I found some solutions that were simple enough it is only bad luck I didn't find them right off the bat). Building a prefix or suffix tree or any other kind of index to the reference was the approach my boss suggested at the start and that I explicitly set out to beat (and did!). Streaming the reference is not the bottleneck if you can do it in an I/O bound way because (for resequencing, at least), the size of the read data is much larger and you can do alignment while coming within a small constant of I/O saturating the input of that read data using a simpler NFA approach.
Looking at the description of MAQ, besting its performance by any number of means is not so surprising. Reaching for the hair of using BWT does surprise me.