Hacker News new | ask | show | jobs
by foota 757 days ago
It seems like this would be most suitable for a system aggregating data. As long as you aggregate enough data points that the error averages out, it wouldn't be an issue.

I guess another use case could be as any kind of "hint" where you need to do an authoritative lookup regardless of the filter lookup.

E.g., the file might be on this host, but you'll need to reach the host and check for the file either way, so if you go to the wrong host sometimes, it's not the end of the world.

That's something that's not possible with a bloom filter.

Seems like you could combine a shared static file and a host local cache to work around the errors as well (e.g., each host can cache whatever keys they've looked up that were wrong, but they can do LRU to get the best of both worlds (frequently accessed data is correct, while you can look up infrequent data with some chance of a miss).

1 comments

I think those are both good examples of where you can manage the cost of a false positive.

In genomics, we're using this to map a DNA substring (or "k-mer") to a value. We can tolerate a very low error rate for those individual substrings, especially since any erroneous values will be random (vs. having the same or correlated values). So, with some simple threshold-based filtering, our false positive problem goes away.

Again, you'll never get the incorrect value for a key in the B-field, only for a key not in the B-field (which can return a false positive with a low, tunable error rate).

Yes, that makes sense.

Ooc, what do you think about the comparison to posting lists (aka bitmap indexes).

Some searching shows ML folks are also interested in compressing KV caches for models, so if your technique is applicable there you can probably find infinite funding :P

So a bitmap index requires a bit per unique value IIRC (plus the key and some amount of overhead). So for ~32 unique values you're already at 4 bytes, 40 bytes per key-value pair for 320 values, etc.

In comparison, a B-field will let you store 32 distinct values at ~3.4 bytes (27 bits) per key-value pair at a 0.1% FP rate.

Yes, I'm not claiming that they're going to perform as well, just that they're sort of similar in the space of problems.

There's different techniques used for compressing them, mostly around sorting the keys and e.g., run length encoding the values.