Show HN: Bloom Filter

Y	Hacker News new \| ask \| show \| jobs

	Show HN: Bloom Filter (sacert.github.io)
	65 points by sacert 3276 days ago

7 comments

aerovistae 3276 days ago

The problem is this doesn't really do much to actually show how it works. It sort of shows what it does, but the utility of it isn't clear ("Why would I want to know if something was possibly found?") and the process by which it determines whether something is found is not really elucidated.

For those who are interested, this page does a really great job explaining it:

https://llimllib.github.io/bloomfilter-tutorial/

---------------

Nonetheless I applaud the author for making it anyway. Takes effort to build these things. Just offering a supplementary resource here.

My own note of explanation about how to use them: when a Bloom filter says something isn't present, it's guaranteed to be true. But a lot of the time it will instead basically say "I'm not sure," and it can never guarantee that something is definitely present. So the point is to use them in contexts where all that matters is to very quickly filter out as many "first occurrence" cases or "this doesn't exist" cases as possible.

That is to say, contexts where you want to identify things that you haven't seen before and which therefore aren't present in the dataset you're checking against, but for which it's okay to be wrong-- cases where being wrong slows you down a little or slightly reduces content quality rather than actually breaking anything. Bloom filters are about increasing quality or speed. Examples here~ https://en.wikipedia.org/wiki/Bloom_filter#Examples

In all the examples you can see how it's okay for the Bloom filter to be wrong. But it saves a lot of time by being right in the majority of cases.

My initial thought when I learned about them was "why not just use a set and be right 100% of the time?" And the answer, I believe, is that the set requires you to store all the items in memory in order to tell whether you've seen one before, which is expensive. A bloom filter is much, much, much smaller. A set is for when you need a 100% accuracy rate, a Bloom filter is not.

link

mrspeaker 3276 days ago

It's especially confusing if, like me, you're only aware of bloom filters from post-processing shaders (https://github.com/cansik/processing-bloom-filter) and just keep typing words waiting for a pretty halo to appear around the text or something.

link

Arcsech 3275 days ago

Also, some databases, such as Cassandra[0] divide data into multiple files on disk which are sorted within the file but not between files, and use bloom filters to prune down the number of files which must be checked for a given key.

[0]: https://cassandra.apache.org/doc/latest/operating/bloom_filt...

link

orthoganol 3276 days ago

With Cuckoo filters visualized too:

https://bdupras.github.io/filter-tutorial/

link

joeblau 3276 days ago

This was the key to our data analytics url de-deuping platform back in 2011. We were pulling in 50k social media messages an hour and there were lots of duplicate links running though our pipeline. We had a 100GB bloom filter backed by Redis to keep a list of all links that came though our system and it worked beautifully.

link

cjhanks 3275 days ago

A 100GB bloom filter? What does that mean? How many hash functions.. how many bits?

link

bane 3275 days ago

I would guess that a 100GB bloom filter would have 800 gigabits.

You can make some guesses, with 7 hashes and a false probability of 1%, a bloom filter designed on a cardinality of 100 billion elements is a little over 100GB.

link

elliottcarlson 3276 days ago

The "not found" message does not clear, so repeated not founds repeat the message.

link

sacert 3276 days ago

Oops - accidentally appended in the wrong spot. Fixed, thanks!

link

bane 3275 days ago

You can get lots of mileage out of bloom filters as well by just using them off of disk. If you're smart about it and use mmap'd files, you only ever end up reading a few bytes at a time. On modern SSDs you can get some really incredible performance not too far off using them in RAM, or you can use a bunch of them for a simple classifier.

here's an implementation: https://axiak.github.io/pybloomfiltermmap/

link

sacert 3276 days ago

For anyone interested, the source code is here: https://github.com/sacert/Bloom-Filter

link

ne0free 3276 days ago

I add qwerty, fire, water, qwerty, land, air, bottle, laptop all this I search for fig and figure and it does mess..

link