How does a Bloom filter with k hash functions hashing to a shared table of m bits compare to just using k hash functions each hashing into its own separate hash table of m/k bits?
Having many filters is worse, but not hugely so. I believe the problem is reducible to the fact that you'll have many smaller filters thus the random distribution hurts you a bit more.
We use blocked bloom filters for a couple of reasons, but one major benefit is the memory locality (our "bloom filter" is 32GB or larger, so it's handy & fast to be able to address it with separate "pages" which are really just individual bloom filters.)
It leads to more false positives. Following the derivation from wikipedia (https://en.wikipedia.org/wiki/Bloom_filter), if we have a bloom filter with k hash functions, m bits, and contains n elements, the probability of a false positive is
(1 - (1 - 1/m)^(kn))^n ~ (1 - exp(-nk/m))^k
With your idea (let's call that Bloom Filter'), we have k hash tables with m/k bits each, so the probability of an element of one of the hash tables being 1 with n elements is 1 - (1 - k/m)^n, so the odds that 1 random position in each of the k hash sets is 1 (i.e. a false positive) is:
(1 - (1 - k/m)^n)^k ~ (1 - exp(-nm/k))^k
Which is very similar, we just switched k/m with m/k inside that exponential. So which has a lower false positive rate for m and k? We do some algebra, assuming that your Bloom Filter' has a lower false positive rate than the regular Bloom Filter, and try to get a contradiction:
(1 - exp(-nm/k))^k < (1 - exp(-nk/m))^k
1 - exp(-nm/k) < 1 - exp(-nk/m)
exp(-nk/m) < exp(-nm/k)
-nk/m < -nm/k
m/k < k/m
m^2 < k^2
m < k
This is a contradiction, because if we have fewer bits than hash functions, we'll wind up with hash tables of size 0. Thus, Bloom Filter' leads to a worse false positive rate than regular Bloom Filters.
P.S. I just did the math in the last 10 minutes, so there could be mistakes. This also only shows your system is less accurate if it has the same m and k as a regular bloom filter, but maybe your system becomes more accurate if using a different value of k. I'm checking that possibility now.
UPDATE: interesting. I tried to find the optimum value of k given n and m for bloom filter' (for regular bloom filters the optimum k = m / n log(2)). But for bloom filter' there is no local or global minimum, only a maximum at k = m (where everything is a false positive). For k < m the false positive value is decreasing, so the optimum under those constraints is k = 1. Which is the same as a regular bloom filter with just 1 hash. Thus, the bloom filter' will never outperform a regular bloom filter with optimal k. The false positive probability for bloom filter' for optimal k=1 is:
The key word to google is "blocked bloom filter" e.g. as proposed in http://algo2.iti.kit.edu/documents/cacheefficientbloomfilter...
Here's a nice paper with some improvements http://tfk.mit.edu/pdf/bloom.pdf
We use blocked bloom filters for a couple of reasons, but one major benefit is the memory locality (our "bloom filter" is 32GB or larger, so it's handy & fast to be able to address it with separate "pages" which are really just individual bloom filters.)