Hacker News new | ask | show | jobs
by openasocket 3186 days ago
It leads to more false positives. Following the derivation from wikipedia (https://en.wikipedia.org/wiki/Bloom_filter), if we have a bloom filter with k hash functions, m bits, and contains n elements, the probability of a false positive is

(1 - (1 - 1/m)^(kn))^n ~ (1 - exp(-nk/m))^k

With your idea (let's call that Bloom Filter'), we have k hash tables with m/k bits each, so the probability of an element of one of the hash tables being 1 with n elements is 1 - (1 - k/m)^n, so the odds that 1 random position in each of the k hash sets is 1 (i.e. a false positive) is:

(1 - (1 - k/m)^n)^k ~ (1 - exp(-nm/k))^k

Which is very similar, we just switched k/m with m/k inside that exponential. So which has a lower false positive rate for m and k? We do some algebra, assuming that your Bloom Filter' has a lower false positive rate than the regular Bloom Filter, and try to get a contradiction:

(1 - exp(-nm/k))^k < (1 - exp(-nk/m))^k

1 - exp(-nm/k) < 1 - exp(-nk/m)

exp(-nk/m) < exp(-nm/k)

-nk/m < -nm/k

m/k < k/m

m^2 < k^2

m < k

This is a contradiction, because if we have fewer bits than hash functions, we'll wind up with hash tables of size 0. Thus, Bloom Filter' leads to a worse false positive rate than regular Bloom Filters.

P.S. I just did the math in the last 10 minutes, so there could be mistakes. This also only shows your system is less accurate if it has the same m and k as a regular bloom filter, but maybe your system becomes more accurate if using a different value of k. I'm checking that possibility now.

UPDATE: interesting. I tried to find the optimum value of k given n and m for bloom filter' (for regular bloom filters the optimum k = m / n log(2)). But for bloom filter' there is no local or global minimum, only a maximum at k = m (where everything is a false positive). For k < m the false positive value is decreasing, so the optimum under those constraints is k = 1. Which is the same as a regular bloom filter with just 1 hash. Thus, the bloom filter' will never outperform a regular bloom filter with optimal k. The false positive probability for bloom filter' for optimal k=1 is:

1 - (1 - 1/m)^n ~ 1 - exp(-n/m)