| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by dasht 5605 days ago

Interesting.

Note that the worst case of complexity for this algorithm is much, much worse than the worst case complexity for Boyer Moore. Do not use this algorithm carelessly. For example, if you use it in a thoughtless way in your web server, you may open yourself to a DoS attack.

Note that the author nicely characterizes it as of potential use for small alphabets and possibly multiple substrings (in a single search). That immediately made me think he might have devised it for genomics research. In most applications I would think you'd also want regexp features. Interestingly, DNA research and use in a regexp engine is something he goes on to suggest. (If you are searching for a very large number of regexps in a big genome database, I would not use this algorithm. I found that some simple variants on classic NFA techniques work very well for a wide class of typical regexps (e.g., regexps modeling SNPs, small read position errors, small numbers of read errors, etc. There probably isn't any one obviously right answer, though, and a lot depends on your particular hardware situation, data set sizes, etc.).

The HN headline is very bogus hype. "X2 times faster than Boyer-Moore" is far from true in the general case. "breakthrough" is a gross exaggeration: this is a technique that anyone with some good algorithms course or two under the belt should be able to think of an, for most applications, decide to not use because of the limitations of the thing. I can definitely see it being nice for some applications tolerant of its limitations but... breakthrough it ain't.

2 comments

thisisnotmyname 5605 days ago

For sequence alignment, the state of the art is BWA, which first compresses the "haystack", then builds a trie.

See http://en.wikipedia.org/wiki/Burrows–Wheeler_transform or http://bioinformatics.oxfordjournals.org/content/early/2009/...

link

dasht 5605 days ago

Thanks. That's interesting. I'm pretty confident that you don't really need to compress the reference that way. It doesn't make a lot of intuitive sense that you would, in a way: streaming over the reference can be very fast and the question is how many reads you can align per pass, how flexibly, and with how low a pre-processing cost. I think my stuff (which doesn't count since we didn't get to publication stage for other reasons, etc.) was probably faster and more flexible.

link

epistasis 5604 days ago

You'd be prerty confident in the other direction once you understand the problem better. The BWT is not compression as much as an extremely clever way of rearranging the haystack. There are many many different string alignment (not string search) algorithms that are useful with DNA, and where the BWT is used your algorithm is not going to be in the realm of useful. BWT based aligners run in time completely independent of the haystack length. When you have 2 billion needles of size 50-200, and the haystack is 3 billion long, it makes a ton of sense to pay the preprocessing cost of O(n lg n), since it only has to be done on the order of once a year.

link

dasht 5604 days ago

Given a (not even very large) cluster, we (with the fairly brute force-ish but cleverly tuned NFA) can populate hash tables for the read-based NFA of your 2 billion needles just about as fast as you can transmit the data and give you optimal alignment (by flexible criteria) just about as fast as you can then stream through the 3 billion base pairs. We are down to, at most, single or double digit dollar differences either way, but mine is vastly simpler and easier to tweak. Probably, we can do my way cheaper as the commodity computing market improves.

If you want to tell me that BWT is more useful for something like searching an FBI or CIA DNA database, and that intel types want to encourage commercial development of BWT (even if irrational for the customers) to subsidize its covert use in intel -- that I can believe. I can see where it would be helpful not for resequencing (a way to read off interesting details of an individuals genotype based on a lot of reads) but rather for fingerprinting of suspects and surveillance targets against a large database of reference genomes.

link

atjj 5604 days ago

Burroughs-Wheeler transform is not just an academic idea, but is the basis for one of the standard tools for aligning short reads: http://bowtie-bio.sourceforge.net/index.shtml . There was a mention of compression earlier in the thread, but this is a red herring: BWT also used in bzip2's compression, but the use with DNA sequences is not to compress, but a pre-processing step on the haystack. As epistasis said, when you have billions needles, it makes sense to pre-process the haystack so that the per-needle cost is as low as possible.

As for the FBI's DNA database, this does not consist of sequence data, but, I believe, just microsatellite data. Even if they had full sequence data, it wouldn't make sense to search for new samples against the whole database, but to align the sample with a reference genome and then match the variations across a database of variations.

link

dasht 5604 days ago

I get that its the standard but see my comment above. If you can come close to I/O saturating inputing read data at only the cost of streaming the reference near I/O saturation, that's optimal -- and I don't think you need the hair of BWT to do that (and the BWT memory cost might even hurt you).

Its different for an imagined database where you are trying to align reads against a lot of personal genomes, e.g., to identify someone given a pretty complete genomic database of the usual suspects plus a few reads from the crime scene -- more around-the-corner "GATTACA" scenario than today's FBI, perhaps. There, perhaps, its economic to dedicate one server to ever N "usual suspects", loading the server with BWT of the suspect database, then stream reads out to the servers rather than loading servers with set of reads and streaming a single-human reference over that. (Of course, maybe you could just align the crime scene reads against a single human reference and then .... So even in GATTACA land use of BWT seems like dubious hair against a lightweight opponent like MAQ)

link

epistasis 5604 days ago

You have just been shown an amazing invention: the BWT. Rather than dismiss it and the entire field of performance obsessed people that find it more useful than any hash based algorithm, perhaps you should study it. The speculation in your comment is laughable, from a cluster costing double digit dollar difference over a single machine, to sequence data being used in forensic DNA. Sure, you throw around some introductory jargon that may fool an outsider into thinking you know what you're talking about, and that's why I'm bothering to comment so that no one is deceived.

BWT =~ suffix tree with far far greater memory efficiency

link

dasht 5604 days ago

Please relax a bit, geeze! For resequencing, you have to input all of the reads and this should be a significantly larger number of bytes than the number of bytes than the number of bytes in the unindexed reference. So let's look at the I/O costs of inputing the read data.

Essentially, any alignment technique that can come close enough to I/O saturating reading both the dominant factor (the reads) and the minor factor (the reference) is optimal.

When (for a slightly different kind of gapped, error-prone read) I wrote alignment software, I also beat MAQ by 10-20X, like the BWA algorithm but without using anything like the BWT hair. I contemplated how the gapping rules and error rules looked in a brute force NFA translation. I found simple ways to optimize that NFA very cheaply. I found that it paid to maximize read-data throughput by devoting all of memory to the NFA built from as many reads as possible. It was not hard, this way, to get in the ballpark (small enough constant factor) of I/O bounding read processing (where "not hard" means something like 3-6 months of staring at walls and scribbling stuff on paper until I found some solutions that were simple enough it is only bad luck I didn't find them right off the bat). Building a prefix or suffix tree or any other kind of index to the reference was the approach my boss suggested at the start and that I explicitly set out to beat (and did!). Streaming the reference is not the bottleneck if you can do it in an I/O bound way because (for resequencing, at least), the size of the read data is much larger and you can do alignment while coming within a small constant of I/O saturating the input of that read data using a simpler NFA approach.

Looking at the description of MAQ, besting its performance by any number of means is not so surprising. Reaching for the hair of using BWT does surprise me.

link

lvvlvv 5604 days ago

> Note that the worst case of complexity for this algorithm is much, much worse than the worst case complexity for Boyer Moore.

Can you explain your thought process for why worst case complexity is worse than other? Did you do any measurements? I believe you are incorrect. I am author of this page. I've just did quick test for SS = "aaa ...aaaBaaa...aaa", with SS_size=240. My algorithm is faster than 3 BMs out of 4 tested.

> Do not use this algorithm carelessly. For example, if you use it in a thoughtless way in your web server, you may open yourself to a DoS attack.

Same can be said about all BM, naive and BSD's memmem/strstr. Possibility of DOS for substring search algorithm was known for a long time - but it never materialized. The cure is trivial - limit substring size.

> That immediately made me think he might have devised it for genomics research.

No, it wasn't. It was devised when I was preparing for Google interview (which I failed). And of cause algorithms which pre-index haystack will always be faster. I myself do not consider haystack pre-indexing algorithms a "text search" algorithms (maybe incorrectly).

> gross exaggeration

If you count pre-indexing algorithms and edge cases - then you are correct. For most common case - can you show me something faster (from a student or even yourself)?

link

eru 5604 days ago

> Can you explain your thought process for why worst case complexity is worse than other? Did you do any measurements? I believe you are incorrect. I am author of this page. I've just did quick test for SS = "aaa ...aaaBaaa...aaa", with SS_size=240. My algorithm is faster than 3 BMs out of 4 tested.

You write on the page that your algorithm has "O(NM) worst case complexity." Compare Boyer-Moore's time complexity. The grandfather was interested in asymptotics. Not in some examples.

> Same can be said about all BM, naive and BSD's memmem/strstr.

Red herring.

> Possibility of DOS for substring search algorithm was known for a long time - but it never materialized.

What are you talking about?

> If you count pre-indexing algorithms and edge cases - then you are correct. For most common case - can you show me something faster (from a student or even yourself)?

What do you mean by egde cases? All the cases that make your program run slow?

Using your terminology, KMP has complexity O(N + M). That's better than O(MN).

link