|
|
|
|
|
by atjj
5605 days ago
|
|
Burroughs-Wheeler transform is not just an academic idea, but is the basis for one of the standard tools for aligning short reads: http://bowtie-bio.sourceforge.net/index.shtml . There was a mention of compression earlier in the thread, but this is a red herring: BWT also used in bzip2's compression, but the use with DNA sequences is not to compress, but a pre-processing step on the haystack. As epistasis said, when you have billions needles, it makes sense to pre-process the haystack so that the per-needle cost is as low as possible. As for the FBI's DNA database, this does not consist of sequence data, but, I believe, just microsatellite data. Even if they had full sequence data, it wouldn't make sense to search for new samples against the whole database, but to align the sample with a reference genome and then match the variations across a database of variations. |
|
Its different for an imagined database where you are trying to align reads against a lot of personal genomes, e.g., to identify someone given a pretty complete genomic database of the usual suspects plus a few reads from the crime scene -- more around-the-corner "GATTACA" scenario than today's FBI, perhaps. There, perhaps, its economic to dedicate one server to ever N "usual suspects", loading the server with BWT of the suspect database, then stream reads out to the servers rather than loading servers with set of reads and streaming a single-human reference over that. (Of course, maybe you could just align the crime scene reads against a single human reference and then .... So even in GATTACA land use of BWT seems like dubious hair against a lightweight opponent like MAQ)