Hacker News new | ask | show | jobs
by bane 3462 days ago
If you find yourself searching lots of haystacks, and your needles are just text and not a regex, a better approach is to stuff all the needles into some kind of index, then chop up the haystack into overlapping tiles (of variable width from the smallest needle to the largest), then search each tile against the index of needles. This effectively searches all the needles at once and turns the operation from O(n) where n is the number of needles to O(m) where m is the number of tiles in haystack.txt.

It may seem to be a trivial difference, but then you can search multiple haystacks at once fairly easily, and this approach scales to hundreds of millions of needles at once. The code for it isn't very difficult either, heck you can just use an in memory SQLite dB to get a searchable, temporary, index and rely on using some of the most tested software in history.

1 comments

This also works for sorting as well and is typically called Radix Sort or Bucket Sort.

Basically using unique attributes of the data you then divide and conquer on those attributes (e.g. for Radix you make buckets based on the digits).