Hacker News new | ask | show | jobs
by dasht 5604 days ago
Given a (not even very large) cluster, we (with the fairly brute force-ish but cleverly tuned NFA) can populate hash tables for the read-based NFA of your 2 billion needles just about as fast as you can transmit the data and give you optimal alignment (by flexible criteria) just about as fast as you can then stream through the 3 billion base pairs. We are down to, at most, single or double digit dollar differences either way, but mine is vastly simpler and easier to tweak. Probably, we can do my way cheaper as the commodity computing market improves.

If you want to tell me that BWT is more useful for something like searching an FBI or CIA DNA database, and that intel types want to encourage commercial development of BWT (even if irrational for the customers) to subsidize its covert use in intel -- that I can believe. I can see where it would be helpful not for resequencing (a way to read off interesting details of an individuals genotype based on a lot of reads) but rather for fingerprinting of suspects and surveillance targets against a large database of reference genomes.

2 comments

Burroughs-Wheeler transform is not just an academic idea, but is the basis for one of the standard tools for aligning short reads: http://bowtie-bio.sourceforge.net/index.shtml . There was a mention of compression earlier in the thread, but this is a red herring: BWT also used in bzip2's compression, but the use with DNA sequences is not to compress, but a pre-processing step on the haystack. As epistasis said, when you have billions needles, it makes sense to pre-process the haystack so that the per-needle cost is as low as possible.

As for the FBI's DNA database, this does not consist of sequence data, but, I believe, just microsatellite data. Even if they had full sequence data, it wouldn't make sense to search for new samples against the whole database, but to align the sample with a reference genome and then match the variations across a database of variations.

I get that its the standard but see my comment above. If you can come close to I/O saturating inputing read data at only the cost of streaming the reference near I/O saturation, that's optimal -- and I don't think you need the hair of BWT to do that (and the BWT memory cost might even hurt you).

Its different for an imagined database where you are trying to align reads against a lot of personal genomes, e.g., to identify someone given a pretty complete genomic database of the usual suspects plus a few reads from the crime scene -- more around-the-corner "GATTACA" scenario than today's FBI, perhaps. There, perhaps, its economic to dedicate one server to ever N "usual suspects", loading the server with BWT of the suspect database, then stream reads out to the servers rather than loading servers with set of reads and streaming a single-human reference over that. (Of course, maybe you could just align the crime scene reads against a single human reference and then .... So even in GATTACA land use of BWT seems like dubious hair against a lightweight opponent like MAQ)

You have just been shown an amazing invention: the BWT. Rather than dismiss it and the entire field of performance obsessed people that find it more useful than any hash based algorithm, perhaps you should study it. The speculation in your comment is laughable, from a cluster costing double digit dollar difference over a single machine, to sequence data being used in forensic DNA. Sure, you throw around some introductory jargon that may fool an outsider into thinking you know what you're talking about, and that's why I'm bothering to comment so that no one is deceived.

BWT =~ suffix tree with far far greater memory efficiency

Please relax a bit, geeze! For resequencing, you have to input all of the reads and this should be a significantly larger number of bytes than the number of bytes than the number of bytes in the unindexed reference. So let's look at the I/O costs of inputing the read data.

Essentially, any alignment technique that can come close enough to I/O saturating reading both the dominant factor (the reads) and the minor factor (the reference) is optimal.

When (for a slightly different kind of gapped, error-prone read) I wrote alignment software, I also beat MAQ by 10-20X, like the BWA algorithm but without using anything like the BWT hair. I contemplated how the gapping rules and error rules looked in a brute force NFA translation. I found simple ways to optimize that NFA very cheaply. I found that it paid to maximize read-data throughput by devoting all of memory to the NFA built from as many reads as possible. It was not hard, this way, to get in the ballpark (small enough constant factor) of I/O bounding read processing (where "not hard" means something like 3-6 months of staring at walls and scribbling stuff on paper until I found some solutions that were simple enough it is only bad luck I didn't find them right off the bat). Building a prefix or suffix tree or any other kind of index to the reference was the approach my boss suggested at the start and that I explicitly set out to beat (and did!). Streaming the reference is not the bottleneck if you can do it in an I/O bound way because (for resequencing, at least), the size of the read data is much larger and you can do alignment while coming within a small constant of I/O saturating the input of that read data using a simpler NFA approach.

Looking at the description of MAQ, besting its performance by any number of means is not so surprising. Reaching for the hair of using BWT does surprise me.