Hacker News new | ask | show | jobs
by ahilss 5696 days ago
I recently used this algorithm for speeding up an iTunes style search function. Originally I did a strstr over every item in the database, but it wasn't quite fast enough. I precomputed a 32-bit mask - 1 bit per alphabetic char, 5 bits per digits, and the last bit for special characters - for every database item, and used that as an initial search filter. I only needed to use the more costly strstr on items that made it through this filter.
2 comments

That's pretty clever. It can be tempting in such a situation to immediately go for the big guns (patricia trees or whatever), but often a simple hack like that gives a lot more bang for the buck.
A Bloom filter (http://en.wikipedia.org/wiki/Bloom_filter) might work well in this case.
Unlikely.

A Bloom filter is usually used where representation size is important, because it can be orders of magnitude smaller than a hashtable. It doesn't perform very quickly though, because of the hashing operations needed during insert.

A Bloom filter is great for P2P search applications for example. Then peers can pass Bloom filters around, and an initial search can happen locally. If it succeeds then the search can ask the remote host if the file actually exists. The frequency those remote requests will fail depends on the number of false positives the Bloom filter is configured to give (ie, a function of its size).

I haven’t tried a Bloom filter, but I think for my application, the simple bit array might give better performance. The nice thing about the bit array is it works great on the worst-case. After the user has typed in 3 or 4 characters of their query, the search space has been narrowed enough where the performance isn't an issue. It’s that first or second character when the search space is large that gives problems. Using the bit array in the single character case, the 1 to 1 mapping of characters to bits gives the result directly.