| HN Mirror

# Why is CRLite able to compress so much data? Bloom filters are probabilistic data structures with an error rate due to data collisions. However, if you know the whole range of data that might be tested against the filter, you can compute all the false positives and build another layer to resolve those. Then you keep going until there are no more false positives. In practice, this happens in 25 to 30 layers, which results in substantial compression.

Mozilla worked with some researchers to analyze a bunch of degenerate cases like that and basically the answer is 'no', that the CRLite paper's calculation of an optimal false positive rate per layer works out quite well.

What does happen in CRLite is that you can't keep shipping the tiny "stash" updates to clients, you have to mint a whole new .mlbf filter file, which is about a megabyte. [Edit:] Then you can resume the "stash" updates from there, but the ecosystem 'shock' requires a regeneration of the filter.

(There was supposed to be a blogpost on the Mozilla blog from the research teams; I don't know if it was ever written.)