| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by armon 3626 days ago

I've given a talk at Papers We Love about Bloom Filters (http://paperswelove.org/2015/video/armon-dadgar-on-bloom-fil...) for those wanting to learn more. There have been quite a few follow on extensions to standard bloom filters that address some of the shortcomings mentioned here.

For example, the cost of having K hash functions can be avoided with some simple math. This is discussed in "Less Hashing, Same Performance" (http://www.eecs.harvard.edu/%7Ekirsch/pubs/bbbf/esa06.pdf). You can also avoid the FPP rate from saturating at 100% using a technique called "Scalable Bloom Filters" (http://gsd.di.uminho.pt/members/cbm/ps/dbloom.pdf). The basic idea is you start with a small BF and layer on progressively larger filters. This also reduces the memory you need to allocate up front without worrying about the FPP rate saturating. As others have mentioned, Counting Bloom Filters allow for counts and deletion as well. The basic point is that there has been a _lot_ of additional work on bloom filters. One of my favorites is "Adaptive Range Filters" (http://www.vldb.org/pvldb/vol6/p1714-kossmann.pdf).

At an advertising company I previously worked for, we relied on bloom filters and hyperloglogs quite heavily for our realtime analytics pipeline. We open sourced our C daemons (https://github.com/armon/bloomd and https://github.com/armon/hlld) which have been running in production for several years.