Hacker News new | ask | show | jobs
by aidos 4615 days ago
Interesting. I'd never looked into how BFs work before and this is a nice clear explaination.

One thing I don't quite get - as you're hashing into the same bit array for each of the hashes - you must get false positives from combinations of hashes. So say you chose a bad combination of hash algorithms where one key hashed into bits a and b using the two different hashes respectively. Maybe some other key might hash into b and a. With a totally suboptimal choice of hash algorithms you could end up with a 50% error rate. Or am I misunderstanding something?

2 comments

This is kind of a fundamental part of bloom filters. They're probabilistic data structures, which is the trade you make for space efficiency.

If you get a false back, you can guarantee that some item is not in the filter. If you get a true back, you can be fairly certain it is. Once you have a degree of certainty, you can do a more thorough check for membership while cutting out large numbers of negatives.

I get the concept - you're trading space for certainty. It's more the bit where it says:

    "The more hash functions you have, the slower your bloom filter, and the quicker it fills up. *If you have too few, however, you may suffer too many false positives.*"
Just made me wonder - couldn't badly chosen hash functions actually give you more false positives?
I agree with what chamblin said, and to add one more point to that, you can choose the amount of uncertainty by using a larger or smaller bit array relative to the number of items you will have.