Hacker News new | ask | show | jobs
by StavrosK 4579 days ago
This got me thinking: The obvious way to do full-text search in the post body would be to build an index of {"word": [post_id, post_id, post_id]} of posts it appears in. However, this could be huge.

Does anyone know if there's a technique that uses a list of posts and a bloom filter that contains all the words in that post? I.e. you iterate over all the posts and check the bloom filter for membership of all the terms.

Since the number of words is (probably) much greater than the number of posts, you just need to loop a few hundred times at most, and you gain a lot in saved space. Also, since a Bloom filter has no false negatives, you are, at least, guaranteed to find all the posts that mention the specified words (with maybe a few "junk" ones in between, but which should be easy for the reader to filter out).

You can't do weighting with this technique, but it should at least be a quick way to figure out which post IDs you want to show.

Does anything like this exist currently?

EDIT: Here's a quick proof of concept: http://nbviewer.ipython.org/gist/skorokithakis/d115ab734d9ad...

It works fine, but the filters are a bit large (2 KB each), so I'm not sure how much space you save.

EDIT 2: This was so much fun that I wrote it up: http://www.stavros.io/posts/bloom-filter-search-engine/

1 comments

Very nice. You will save some space and allow for some typos if you stem and soundex before insertion. Also you can save space and improve the run time somewhat if rather than many separate bloom filters you build one large one where each item is post ID + word. If you do that you can also insert each word bare so you get O(1) empty result sets, helpful if you're updating the results with every keystroke in a search box.
Huh, very nice idea! That should, indeed save a ton of space and be much simpler when searching! I'll try that now, thank you.

EDIT: Hmm, turns out it's pretty much the same size, which makes sense, I guess: http://nbviewer.ipython.org/gist/skorokithakis/0abbfebced25f...

The space savings for the same error rate should be small (I think the likelihood of false positives for a given load goes down slightly with size of the filter) but the benefit in lookup time should be significant for multiword searches. Thinking about it more though if you're doing live search you'll already have computed the results for the first word by the time you are given a second so maybe it doesn't matter.
I think it'll be faster to do multiple filters, because the one-filter way requires hashing and comparing N times while the multiple-filter way requires hashing once and comparing N times.
Oops, you are correct.