Hacker News new | ask | show | jobs
by ComputerGuru 757 days ago
Curious idea. So it’s for cases where you have any key but associated with one of only (preferably few) discrete values. I.E. your url example is great with url as a key but subpar if url were to be the value (padded to length n with trailing nulls encoded as a fixed width int array)?

With its interesting set of guarantees, I can’t see a case where you could use this unless you are 100% positive all keys have previously been inserted into the set, otherwise you risk getting a wrong value in return (instead of no value). A traditional bloom filter is similar but in the worst case you throw away work because you look up the determinative data/value but here it’s a bit trickier.

Lots of applications tolerate missing results but significantly fewer can tolerate “unknowingly incorrect” results.

Question about the implementation: I would have expected the primary interface to be in-memory with some api for disk spillover for large datasets but while all the docs say “designed for in-memory lookups” the rust api shows that you need to provide it with a temp directory to create the structure? (Also, fyi, you use temp::temp_file() but never actually use the result, instead using the hard-coded /tmp path.)

3 comments

It seems like this would be most suitable for a system aggregating data. As long as you aggregate enough data points that the error averages out, it wouldn't be an issue.

I guess another use case could be as any kind of "hint" where you need to do an authoritative lookup regardless of the filter lookup.

E.g., the file might be on this host, but you'll need to reach the host and check for the file either way, so if you go to the wrong host sometimes, it's not the end of the world.

That's something that's not possible with a bloom filter.

Seems like you could combine a shared static file and a host local cache to work around the errors as well (e.g., each host can cache whatever keys they've looked up that were wrong, but they can do LRU to get the best of both worlds (frequently accessed data is correct, while you can look up infrequent data with some chance of a miss).

I think those are both good examples of where you can manage the cost of a false positive.

In genomics, we're using this to map a DNA substring (or "k-mer") to a value. We can tolerate a very low error rate for those individual substrings, especially since any erroneous values will be random (vs. having the same or correlated values). So, with some simple threshold-based filtering, our false positive problem goes away.

Again, you'll never get the incorrect value for a key in the B-field, only for a key not in the B-field (which can return a false positive with a low, tunable error rate).

Yes, that makes sense.

Ooc, what do you think about the comparison to posting lists (aka bitmap indexes).

Some searching shows ML folks are also interested in compressing KV caches for models, so if your technique is applicable there you can probably find infinite funding :P

So a bitmap index requires a bit per unique value IIRC (plus the key and some amount of overhead). So for ~32 unique values you're already at 4 bytes, 40 bytes per key-value pair for 320 values, etc.

In comparison, a B-field will let you store 32 distinct values at ~3.4 bytes (27 bits) per key-value pair at a 0.1% FP rate.

Yes, I'm not claiming that they're going to perform as well, just that they're sort of similar in the space of problems.

There's different techniques used for compressing them, mostly around sorting the keys and e.g., run length encoding the values.

> So it’s for cases where you have any key but associated with one of only (preferably few) discrete values

We use it for a case with ~million unique values, but it's certainly more space efficient for cases where you have tens, hundreds, or thousands of values. The "Space Requirements" section has a few examples: https://github.com/onecodex/rust-bfield?tab=readme-ov-file#s... (e.g., you can store a key-value pair with 32 distinct values in ~27 bits of space at a 0.1% false positive rate).

> all the docs say “designed for in-memory lookups”

We use mmap for persistence as our use case is largely a build-once, read many times one. As a practical matter, the data structure involves lots of random access, so is better suited to in-memory use from a speed POV.

> fyi, you use temp::temp_file() but never actually use the result, instead using the hard-coded /tmp path

Thank you, have opened an issue and we'll fix it!

Sure, but I wouldn’t expect the api to force you to use an mmap when a slice of bytes would accomplish the same when unpersisted (and the user could choose to persist via a different mechanism if you have a .into() method that decays self into a Vec<u8>/Box<[u8]>/etc)

If I were to design this library, I would internally use an enum { Mapped(mmap), Direct(Box<[u8]>) } or better yet, delegate access and serialization/persistence to a trait so the type becomes BField<Impl> where the impl trait provides as_slice() and load()/save().

This way you abstract over the OS internals, provide a pure implementation for testing or no_std, and probably improve your codegen a bit.

Wonder if the error rate can be controlled?
Yes you can manage the error rate by controlling the overall size of the allocated bit array and several other parameters. There's a (slightly obtuse) section on parameter selection here: https://github.com/onecodex/rust-bfield?tab=readme-ov-file#p...

And a Jupyter notebook example here: https://github.com/onecodex/rust-bfield/blob/main/docs/noteb...

We do need a better "smart parameter selection" method for instantiating a B-field on-the-fly.