Hacker News new | ask | show | jobs
by ianamartin 2158 days ago
I don't know enough about bloom filters or the math involved to say whether or not you are correct.

But what you're saying sounds a lot like many proponents of various flavors of NoSQL, Eventual Consistency, etc. I.e., it sounds like you are saying that lots of people have been using it for a long time, and it's just good enough. The actual correctness isn't really that big a deal.

I might be crossing domain boundaries here and thinking up stuff that really doesn't matter. I mean, bloom filters have a known challenge. No one was ever supposed to use them except for statistical purposes anyway, and with a known error rate, it's fine to add into your models.

But there's also something that feels a little weird about your point. It sounds like you're saying it's good enough for X, Y, and Z use cases; therefore it doesn't really matter how technically wrong it is.

But again, I could be really off.

3 comments

Your comparison isn't really appropriate. Bloom filters work, they worked before this proof and they work the same way after the proof. This result isn't about how correct Bloom filters are or aren't. Their caveats are understood, and even using the "bad" math from the original paper, their performance in practical scenarios could be estimated with enough accuracy to use them in practical applications. I dare say not a single user of bloom filters will change anything about their implementation stemming from this result.

For example, if you size a bloom filter a certain way, the "bad" math might tell you your false positive rate is 0.001%, while the "good" math might tell you your false positive rate is 0.001002%. It makes no difference. The error is orders of magnitude smaller than the number you get anyway. (I made those numbers up, but I've used Bloom filters and they should be in the ballpark for the sizes I've worked with). The bad math might be strictly speaking incorrect, but it's a good enough approximation for all practical purposes.

This is different from Eventual Consistency stuff, which has real practical implications from not having certain guarantees. Those limitations are real, and they have real consequences, not just a rounding error in a number.

The difference between approximate and exact false positive rate calculations in real-world scenarios is insignificant. It's fine to use the approximation.

If you had very specific tolerances here you wouldn't work in terms of the expected false positive rate anyway; you'd build your filter and validate on the real thing.

A footnote in the article notes that the original "faulty" formula and the new "correct" one are asymptotically equivalent.