|
Interesting. Note that the worst case of complexity for this algorithm is much, much worse than the worst case complexity for Boyer Moore. Do not use this algorithm carelessly. For example, if you use it in a thoughtless way in your web server, you may open yourself to a DoS attack. Note that the author nicely characterizes it as of potential use for small alphabets and possibly multiple substrings (in a single search). That immediately made me think he might have devised it for genomics research. In most applications I would think you'd also want regexp features. Interestingly, DNA research and use in a regexp engine is something he goes on to suggest. (If you are searching for a very large number of regexps in a big genome database, I would not use this algorithm. I found that some simple variants on classic NFA techniques work very well for a wide class of typical regexps (e.g., regexps modeling SNPs, small read position errors, small numbers of read errors, etc. There probably isn't any one obviously right answer, though, and a lot depends on your particular hardware situation, data set sizes, etc.). The HN headline is very bogus hype. "X2 times faster than Boyer-Moore" is far from true in the general case. "breakthrough" is a gross exaggeration: this is a technique that anyone with some good algorithms course or two under the belt should be able to think of an, for most applications, decide to not use because of the limitations of the thing. I can definitely see it being nice for some applications tolerant of its limitations but... breakthrough it ain't. |
See http://en.wikipedia.org/wiki/Burrows–Wheeler_transform or http://bioinformatics.oxfordjournals.org/content/early/2009/...