| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by aerovistae 3276 days ago

The problem is this doesn't really do much to actually show how it works. It sort of shows what it does, but the utility of it isn't clear ("Why would I want to know if something was possibly found?") and the process by which it determines whether something is found is not really elucidated.

For those who are interested, this page does a really great job explaining it:

https://llimllib.github.io/bloomfilter-tutorial/

---------------

Nonetheless I applaud the author for making it anyway. Takes effort to build these things. Just offering a supplementary resource here.

My own note of explanation about how to use them: when a Bloom filter says something isn't present, it's guaranteed to be true. But a lot of the time it will instead basically say "I'm not sure," and it can never guarantee that something is definitely present. So the point is to use them in contexts where all that matters is to very quickly filter out as many "first occurrence" cases or "this doesn't exist" cases as possible.

That is to say, contexts where you want to identify things that you haven't seen before and which therefore aren't present in the dataset you're checking against, but for which it's okay to be wrong-- cases where being wrong slows you down a little or slightly reduces content quality rather than actually breaking anything. Bloom filters are about increasing quality or speed. Examples here~ https://en.wikipedia.org/wiki/Bloom_filter#Examples

In all the examples you can see how it's okay for the Bloom filter to be wrong. But it saves a lot of time by being right in the majority of cases.

My initial thought when I learned about them was "why not just use a set and be right 100% of the time?" And the answer, I believe, is that the set requires you to store all the items in memory in order to tell whether you've seen one before, which is expensive. A bloom filter is much, much, much smaller. A set is for when you need a 100% accuracy rate, a Bloom filter is not.

2 comments

mrspeaker 3276 days ago

It's especially confusing if, like me, you're only aware of bloom filters from post-processing shaders (https://github.com/cansik/processing-bloom-filter) and just keep typing words waiting for a pretty halo to appear around the text or something.

link

Arcsech 3276 days ago

Also, some databases, such as Cassandra[0] divide data into multiple files on disk which are sorted within the file but not between files, and use bloom filters to prune down the number of files which must be checked for a given key.

[0]: https://cassandra.apache.org/doc/latest/operating/bloom_filt...

link