Hacker News new | ask | show | jobs
by bumbledraven 2248 days ago
What is a minimal test case (needle + haystack) for which the old code yields a different result than the new code?
2 comments

The smallest test case appears to be a needle of "aaa" and a haystack of "xxaaa", where VS 2019 16.5's boyer_moore_searcher will report "not found", while the corrected code finds the needle at offset 2 in the haystack.
Seems astonishing that nobody has run into this in the wild. How does it do on finding ‘www’ in ‘http://www.example.com/, for example?
I conclude that usage of std::boyer_moore_searcher is relatively low, despite this C++17 feature shipping in VS 2017 15.3 (August 2017).

"www" triggers the bug like all 3-character repeats, but Boyer-Moore manages to find it in "http://www.example.com/" despite the damaged delta2 table, because the incorrect shift value is never exercised. However, a haystack of "https://www.example.com/" triggers the bug, because now the "www" is at an offset of 8.

You need a needle that triggers the problem, AND a haystack where the delta2 table matters, AND the algorithm to stop at a problematic index where it uses a bad delta2 value.

There's a reason the published version even in 'the papers' was wrong for 3 years.

There are a bunch of test cases here:

https://github.com/microsoft/STL/pull/724/commits/a7da5cce8c...

Looks like it can happen with quite small needles, eg. "aa".

After comparing the outputs of the old and new algorithms, I believe that a needle needs more than two repeats in order to trigger the bug. That is, "aa" and "abab" don't trigger the bug, but "aaa" and "ababa" do, the last one having a partial third repeat (needle "ababa" with haystack "WXYZababa" was definitely wrong). I tested the table output for "aa" for completeness (and just in case we manage to damage it in the future).

Disclaimer: I don't understand the algorithms deeply enough to write them from scratch, and I'm writing this at 6 AM.