The problem is this doesn't really do much to actually show how it works. It sort of shows what it does, but the utility of it isn't clear ("Why would I want to know if something was possibly found?") and the process by which it determines whether something is found is not really elucidated.
For those who are interested, this page does a really great job explaining it:
Nonetheless I applaud the author for making it anyway. Takes effort to build these things. Just offering a supplementary resource here.
My own note of explanation about how to use them: when a Bloom filter says something isn't present, it's guaranteed to be true. But a lot of the time it will instead basically say "I'm not sure," and it can never guarantee that something is definitely present. So the point is to use them in contexts where all that matters is to very quickly filter out as many "first occurrence" cases or "this doesn't exist" cases as possible.
That is to say, contexts where you want to identify things that you haven't seen before and which therefore aren't present in the dataset you're checking against, but for which it's okay to be wrong-- cases where being wrong slows you down a little or slightly reduces content quality rather than actually breaking anything. Bloom filters are about increasing quality or speed. Examples here~ https://en.wikipedia.org/wiki/Bloom_filter#Examples
In all the examples you can see how it's okay for the Bloom filter to be wrong. But it saves a lot of time by being right in the majority of cases.
My initial thought when I learned about them was "why not just use a set and be right 100% of the time?" And the answer, I believe, is that the set requires you to store all the items in memory in order to tell whether you've seen one before, which is expensive. A bloom filter is much, much, much smaller. A set is for when you need a 100% accuracy rate, a Bloom filter is not.
It's especially confusing if, like me, you're only aware of bloom filters from post-processing shaders (https://github.com/cansik/processing-bloom-filter) and just keep typing words waiting for a pretty halo to appear around the text or something.
Also, some databases, such as Cassandra[0] divide data into multiple files on disk which are sorted within the file but not between files, and use bloom filters to prune down the number of files which must be checked for a given key.
This was the key to our data analytics url de-deuping platform back in 2011. We were pulling in 50k social media messages an hour and there were lots of duplicate links running though our pipeline. We had a 100GB bloom filter backed by Redis to keep a list of all links that came though our system and it worked beautifully.
I would guess that a 100GB bloom filter would have 800 gigabits.
You can make some guesses, with 7 hashes and a false probability of 1%, a bloom filter designed on a cardinality of 100 billion elements is a little over 100GB.
You can get lots of mileage out of bloom filters as well by just using them off of disk. If you're smart about it and use mmap'd files, you only ever end up reading a few bytes at a time. On modern SSDs you can get some really incredible performance not too far off using them in RAM, or you can use a bunch of them for a simple classifier.
For those who are interested, this page does a really great job explaining it:
https://llimllib.github.io/bloomfilter-tutorial/
---------------
Nonetheless I applaud the author for making it anyway. Takes effort to build these things. Just offering a supplementary resource here.
My own note of explanation about how to use them: when a Bloom filter says something isn't present, it's guaranteed to be true. But a lot of the time it will instead basically say "I'm not sure," and it can never guarantee that something is definitely present. So the point is to use them in contexts where all that matters is to very quickly filter out as many "first occurrence" cases or "this doesn't exist" cases as possible.
That is to say, contexts where you want to identify things that you haven't seen before and which therefore aren't present in the dataset you're checking against, but for which it's okay to be wrong-- cases where being wrong slows you down a little or slightly reduces content quality rather than actually breaking anything. Bloom filters are about increasing quality or speed. Examples here~ https://en.wikipedia.org/wiki/Bloom_filter#Examples
In all the examples you can see how it's okay for the Bloom filter to be wrong. But it saves a lot of time by being right in the majority of cases.
My initial thought when I learned about them was "why not just use a set and be right 100% of the time?" And the answer, I believe, is that the set requires you to store all the items in memory in order to tell whether you've seen one before, which is expensive. A bloom filter is much, much, much smaller. A set is for when you need a 100% accuracy rate, a Bloom filter is not.