Hacker News new | ask | show | jobs
by bobosha 408 days ago
what are some real-world usecases that people use it for?
9 comments

I like to think of it as a magical checklist, it helps us to quickly check if something _might_ be there, without actually looking through everything.

A few non-exhaustive real world use-cases that come to mind:

- Databases: To quickly check if a record might exist before doing a costly disk lookup.

- Spell Checkers: To check if a word might be in the dictionary.

- Spam Filters: To check if an email sender is on a list of known spammers.

- Browser Security: Chrome uses Bloom filters to check if a site might be malicious.

- Password Checker: To check if a password is known to be leaked.

- Web Caches: To check if a URL or resource is definitely not in the cache.

- Distributed Systems: To avoid sending data that another system definitely doesn’t need.

It’s big for caches. As your application grows in complexity you get dependent lookups. I need the user id to get the company id. I need the company id to figure out what features the user has access to. And then I need to run a bunch of queries to pull data for those feature.

And while using the cache may cut an order of magnitude off of the overall time, you’ve gone from hundreds of milliseconds to tens, but with a bloom filter you can figure out you have a cache miss faster and start the process of fetching the data sooner. The user may not notice the small improvement in response time, but by Little’s Law your cluster size can be smaller for the same traffic.

Web browsers use bloom filters to determine which CSS rules apply to which elements. IIRC Chrome removed a perf screen for CSS rules because most people were getting results below the noise floor for the timing function. The time to load the CSS was still relevant (maybe moreso due to the higher setup cost of the filters).

https://news.ycombinator.com/item?id=42486610

Discussion of how Bloom filters made SQLite much faster.

They are used somewhat commonly in bioinformatics where lookup tables can have long keys and many entries [1].

1. https://en.m.wikipedia.org/wiki/Bloom_filters_in_bioinformat...

There’s loads of high throughput data cases where asking “does this already exist?” is a high cost tradeoff and bloom filters answer it with minimal resources. If the cost of a “no” is 100x lower than “yes” then a 1% bloom filter breaks even. If it’s a 10000x then a bloom filter is 100x better.

One of the most interesting applications are rolling time indexed filters. If you have T filters representing the current time interval, you can retain D filters to represent a value being seen in a Duration. Each new I time segment you add 1 filter representing I/T time and test against all filters in D. A consecutive positive result for T filters gives a range of I-T existence for set membership and both a positive result and the time range that it was found within.

If you have a whole cluster of machines that have data on them and you need to ask: "does this machine probably have or not have this data?"; a bloom filter will tell you an answer. It can save a ton of time since a bloom filter's answer is "probably yes" and "definitely no."
Bloom filters can be extremely fast to tell you if something is not in a dataset.

They can give false positives incorrectly indicating an element might be in the set when it's not, but never false negatives

Knowing (fast) if something is not in a dataset can be very useful.

Initially, Bitcoin light wallets were built with bloom filters. So a full node would only propagate transactions, which satisfy a bloom filter, given by light client to that light client. The bloom filter seems not to be privacy preserving, that was one reason why this was abondend by some wallets. However, bitcoinj and wallets built on top of it, might still use this...
permission checks (allow/denylists)
Using them for whitelists is probably not a great idea because they can give false positives. An attacker could potentially flood the filter with fake accounts and increase the rate of false positives, increasing the chance they're granted access.

For blacklists, potentially more suitable, but since it can also give false positives, it could deny permission to people who should have it. An attacker might also attack this - by flooding the filter with accounts that deliberately get blacklisted, they could lock out people who should have access.

Obviously this is very use-case specific - it's probably not the best approach to doing permissions if security is paramount.

No, but they can tell you a user is definitely not in an allowlist or blocklist. That is useful, especially if it can save a database lookup on every check.
That may work, but there are potential issues with that regarding timing attacks. If an attacker could make many attempts to access a resource, they may be able to figure out who (probably) has access with a brute-force timing test, and narrow down an attack target.
I'm not sure I understand. Generally, an allow-list/block-list is for authorized resources? By the time you are doing this check, the user is already authenticated and this is part of authorization. So, the user shouldn't be able to authenticate as arbitrary users to do a timing attack. If they can, you have bigger problems.