Hacker News new | ask | show | jobs
by matt4711 5605 days ago
Pattern matching performance also depends on the alphabet size of the text. In his experiment he doesn't report the alphabet size of the text nor does he provide results for different text collections.

The algorithm itself looks very similar to the one used in agrep proposed by Wu and Manber [1].

I also found the book "Flexible Pattern Matching in Strings" to be a very good reference on all things related to pattern matching [2].

[1] S. Wu and U. Manber. A fast algorithm for multi-pattern searching. Report TR-94-17, Department of Computer Science, University of Arizona, 1994.

[2] http://www.amazon.com/Flexible-Pattern-Matching-Strings-Line...

1 comments

He talks a bit about how to pick the right number of successive letters to use as hash keys - which is where you can get a handle on alphabet sizes. I would guess (maybe it actually says) that he Wikipedia dump in the benchmark was UTF-8 or ASCII and, either way, treated as an alphabet of 8-bit characters. The DNA case is kind of interesting (2 bits min but more likely 3 or 4 in a typical genome record).
An illustration from the book I cited above showing the importance of the alphabet size (y-axis) and the pattern length (x-axis):

http://i.imgur.com/KGOZW.jpg

In the experiment he used patterns of different length on the same text collection. As you can see in the graph, different algorithms perform best for a certain alphabet size.

He describes the text collection as "text corpus taken from wikipedia text dump" so I'm guessing the alphabet size is around 90?

It's also probably not a good thing that all the strings he is searching for are prefixes of the same pattern.

References:

Shift-Or: http://www-igm.univ-mlv.fr/~lecroq/string/node6.html#SECTION...

BNDM: http://www-igm.univ-mlv.fr/~lecroq/string/bndm.html#SECTION0...

BOM: http://www-igm.univ-mlv.fr/~lecroq/string/bom.html#SECTION00...