| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by cmurphycode 2600 days ago

In deduplication, we check to see whether a block has been seen before. If it has, we don't write it. If it hasn't, we write it down and record that information for next time. Checking if a block has been seen requires looking in an index, which is large and expensive to check, but recording the store can be amortized.

So, a bloom filter can tell you whether it's worth spending the expensive check to verify that the block is a duplicate, or whether you should skip that behavior and move on directly.

Same is true for malicious site detection. Google offers this feature in Chrome, but we don't want to make an expensive network request for every URL we visit. And we can't really have a database of all malicious URLs downloaded and updated to everyone's computer. Instead, they can distribute a bloom filter of malicious sites. if the bloom filter says it's malicious, we can afford to check the network for the final answer. If the bloom filter says it's not, we know we don't have to check (and that's the most common case, too!)

In my opinion, the intuition to extract is this. If you need to check membership, the set is expensive to check, and getting a definitive NO is valuable, then you can consider a bloom filter (or another probabilistic structure like a cuckoo filter). In dedup, we need to check membership, the index is big enough to not fit in RAM so it's expensive, and getting a definitive NO means we can skip the index check. In malicious site detection, we need to check membership, the answer requires a network round trip so it's expensive, and getting a definitive NO means we can just move on to loading the site.