Hacker News new | ask | show | jobs
by atjj 5605 days ago
Burroughs-Wheeler transform is not just an academic idea, but is the basis for one of the standard tools for aligning short reads: http://bowtie-bio.sourceforge.net/index.shtml . There was a mention of compression earlier in the thread, but this is a red herring: BWT also used in bzip2's compression, but the use with DNA sequences is not to compress, but a pre-processing step on the haystack. As epistasis said, when you have billions needles, it makes sense to pre-process the haystack so that the per-needle cost is as low as possible.

As for the FBI's DNA database, this does not consist of sequence data, but, I believe, just microsatellite data. Even if they had full sequence data, it wouldn't make sense to search for new samples against the whole database, but to align the sample with a reference genome and then match the variations across a database of variations.

1 comments

I get that its the standard but see my comment above. If you can come close to I/O saturating inputing read data at only the cost of streaming the reference near I/O saturation, that's optimal -- and I don't think you need the hair of BWT to do that (and the BWT memory cost might even hurt you).

Its different for an imagined database where you are trying to align reads against a lot of personal genomes, e.g., to identify someone given a pretty complete genomic database of the usual suspects plus a few reads from the crime scene -- more around-the-corner "GATTACA" scenario than today's FBI, perhaps. There, perhaps, its economic to dedicate one server to ever N "usual suspects", loading the server with BWT of the suspect database, then stream reads out to the servers rather than loading servers with set of reads and streaming a single-human reference over that. (Of course, maybe you could just align the crime scene reads against a single human reference and then .... So even in GATTACA land use of BWT seems like dubious hair against a lightweight opponent like MAQ)