|
|
|
|
|
by johnboyer
2876 days ago
|
|
I would most definitely be interested. I always assumed that certain algorithms are better suited for certain string sizes, but I was never sure which ones. The ideal implementation is probably combining multiple algorithms with ranges of the string size. |
|
1. ShiftOr for short strings. Easy to implement.
This algorithm is not sub-linear like Boyer Moore - it examines every position, but it uses bit-parallelism to validate a match, making it outperform algorithms that rely on jumping then separate verification stages, for short strings.
2. Variants of Wu-Manber for longer strings. Hard to find a good description, but not too hard to implement.
Wu-Manber is a search algorithm designed for multi-pattern searching, based on Boyer-Moore-Horspool and hashes. However, it also performs really well for single-pattern searching. I have encountered variants of this in various places, e.g. Lecroq's in "Fast Exact String Matching Algorithms".
These algorithms use a hash of a q-gram to look up how far it's safe to shift. Q-grams tend to appear less frequently in search patterns vs. single characters, so you get longer jumps, at the cost of reading more characters and producing a hash.
3. Horspool (or Boyer-Moore-Horspool).
This algorithm performs quite well - not as well as ShiftOr for shorter patterns or Wu-Manber variants for longer ones, but still respectable. It's essentially Boyer-Moore but only using one of the shift tables, which makes it much easier to implement.
4. Qgram-Filtering by Branislav Durian, Hannu Peltola, Leena Salmela and Jorma Tarhio
For longer patterns, this algorithm outperforms the others mostly. However, it can be a bit complicated to implement well, and it has some nasty worst-cases (rarely encountered) where the performance becomes absolutely dreadful. For that reason I tend not to use it.
There are hundreds of possible search algorithms available now (see https://github.com/smart-tool/smart for implementations of many of them with a benchmark tool). However, it's hard to figure out exactly which algorithm is the best given your data and pattern. For that reason, I tend to keep things simple.
I would just use ShiftOr for short patterns, and another one for longer patterns. I would tend to use a Wu-Manber variant there, but Horspool would probably give acceptable performance.
The only other consideration is the time it takes to build the pattern indexes. For short searches, or if you have to re-build the pattern index on each repeated search, it can actually be quicker to just do a naive "check the string at each position" search, since it doesn't require building anything before searching.