Hacker News new | ask | show | jobs
by andrewla 761 days ago
Depending on the type of string search this is not completely true. A naive string search is O(kn) (where k is the length of the search string) which is actually slower than the O(n) that a DFA can get.

But KMP and similar algorithms can do better; they can get performance approaching O(n/k) for many cases by not even looking at every character of the input string.

2 comments

KMP always looks at every position, but it is linear.

There is a variant of Boyer Moore which is sublinear and linear in the worst case, but it's not that fast by modern standards.

The fastest sub linear search algorithms I know that remain linear in the worst case are Linear Weak Factor Recognition (LWFR) and Linear Hash Chain (LHC). LHC is currently unpublished (I'm writing a paper on it right now).

In practice, it's apparently often better to avoid these techniques and just use SIMD operations to search. You look at every byte, so it's O(n) but the overhead of more complicated approaches isn't worth it. See this comment from burntsushi: https://lobste.rs/s/ycydmd/why_gnu_grep_is_fast_2010#c_gpim7....
For short patterns this is probably right. Longer patterns will be faster using a modern sublinear search algorithm.
As the author of ripgrep, I wouldn't necessarily buy this. I suppose I might agree with it in the extremes, but SIMD prefilters are quite exceptionally difficult to beat with scalar code in the common cases. Longer patterns are great for the SIMD algorithms in ripgrep because of its use of background frequency distribution heuristics. That is, the longer the pattern, the less likely your candidate detection is to produce a false positive in its hot path. (I say "less likely" here intentionally. One can of course pick specific needles and haystacks where a false positive is reported at every position for any length N.)

I don't mean to 100% disagree with you, but I think it's misleading to suggest a sort of one dimensional view of things where, as the pattern gets larger, SIMD gets worse compared to sublinear search algorithms. There are other factors at play here, and, importantly, what "long" means in this context.

In many practical circumstances, "short" might be a few bytes, while "long" is 16 bytes. But maybe your idea of "long" is actually much longer.

If you're curious how your own algorithm stacks up to ripgrep's, you can plug your implementation into the `memchr` crate's benchmark harness: https://github.com/BurntSushi/memchr

It uses rebar: https://github.com/BurntSushi/rebar

Interesting, thanks for the detailed response. I'll have a look at the benchmark; I'm doing some work on algorithm benchmarking right now by coincidence.

I'd say a long pattern is more like 64 bytes (the benchmarking suite I use defines short patterns as 32 or under).

Edit: will also check out the frequency approach used in ripgrep, sounds fascinating.

You can read more about it at various points in these two blogs I've written:

https://blog.burntsushi.net/ripgrep/

https://blog.burntsushi.net/regex-internals/

64 bytes is decently long, but I'd still put my money on SIMD for "common" cases.