Hacker News new | ask | show | jobs
by snissn 3488 days ago
I've been running similar algorithms in production for around 4 years. One issue we've never solved wonderfully are unit tests on our bloom filter algorithms, we've used end to end tests to our satisfaction but not really tests of the bloom filter algorithms itself. Do you have some test / benchmark code that you could recommend for testing different bloom filter false positive rates?
2 comments

Google's Guava Bloom implementation is very well tested. I suggest looking at their guava test GitHub repo. They even use pre-seeded and manually verified filters as controls as some responses suggested.

Or you could just use Guava's Bloom ;)

As for probabilistic testing of fp rate... The problem is that every once in a while a test will fail.

Disclaimer: I wrote a cuckoo filter library.

If you want to test fp rate check my cuckoo filter test at sanityOverFillFilter() in github.com/MGunlogson/CuckooFilter4J/blob/master/src/test/java/com/github/mgunlogson/cuckoofilter4j/TestCuckooFilter.java

The unit test basically does fuzzing, filling the filter repeatedly in the hopes that one day any errors will surface. The error bounds are pretty large but small enough to detect any egregious failures. Importantly, my filter defaults to a random seed. Guava DOES NOT, so any tests using the same items will be deterministic. The guava filters use this property to verify some filters that have been manually determined to be correct

I'm thinking of building a test suite which takes a distribution of distinct elements and run a test for FPP and FNP. Then it asserts whether the the actual FPP and FNP is within an epsilon of the calculated probability.

I want to add statistics for mine to hopefully be able to monitor them at least by some estimate.

I would like to add to this that it's incredibly important you use fixed seed, or at least logged seed tests for this.

If you fail to do this you get unreproducable ghost tests that you can never investigate if they happen to fail rarely.