Hacker News new | ask | show | jobs
by devj 2265 days ago
I have a requirement of checking group membership wherein a group may contain > 10K members. Each member is an 8byte ID. Do you think Xor filter would be a better fit compared to Bloom here? Or am I looking at it the wrong way? Thanks.
8 comments

> Do you think Xor filter would be a better fit compared to Bloom here?

The 8byte key is the only scenario where you should consider XorPlus (i.e a 8 bytes mapped to a long).

The lookup properties of the Xor filter are better with that case, but the real question is whether you have an entire collection to start building the bitset or not.

The sketch production isn't incremental - there is no add(k) after building it once.

So you can't build add data once it is built, while the Bloom filters do support adding entries after the fact (in fact, it can add bloom filters into it, rather than sending all the new keys).

And both of those approaches are missing an unset operation.

Sorry.. Forgot to mention that.

Yes insertion/deletion is required and the frequency may be higher. The exact usecase is that group membership is dependent on some conditions. We schedule this check every 15 minutes and based on it, we are adding/deleting the members.

Xor filters require all the members of the set be provided up front.

Bloom filters allow adding members, but not removing them.

Cuckoo filters allow removing members: https://www.cs.cmu.edu/~dga/papers/cuckoo-conext2014.pdf

And just for completeness, storing everything losslessly in a btree, radix tree, or hash table would probably be under 300 kB. (80 kB just for the IDs plus, say, an additional ~2x overhead.)

Safe removal is only possible if you know it is already in filter. The 'you can delete from cuckoo filters' bit I feel is often oversold.
Oh wow, I only read the paper's abstract and was fooled. Thanks for the correction.
You can use a counting Bloom Filter that allows deleting. Depending on your update requirements you might also be ok with two bloom filters, one for that has the group memberships and one that has group removals.
10k members every 15 minutes? Just rebuild it from scratch.
Or use a Cuckoo fiter.
> The sketch production isn't incremental - there is no add(k) after building it once.

Which means they can't be used for distributed set intersections, unions or cardinality estimations, a not insignificant side-benefit of bloom filters.

Might as well use a simple hashmap if that's all you have. Unless you have some extreme performance requirements.
This has much better storage/memory bounds than a hash table. The end result is less cache misses.
Sure, but 10K items is so little that you might as well not bother. If you're worried about memory usage you can also just scan the whole table each time (no such thing as a cache miss when the whole table fits in cache).
This is better than a hash table for membership testing. As for compared to table scan depends on your key size and number of keys.

This algorithm means you could easily have more than 10k especially for larger keys, and still likely reside in one of the cache levels. Heck a lot CPUs cant hold 10k items in L1. Like for 64 bit numbers thats 80Kb. Let alone for something like UUIDs or other larger sparse keys.

Also main memory access is aprox. delay of a few thousand cycles. Iterating a table of 10,000 items is going to be close to that. A hit in L1 is about 3 cycles. So a table scan is possibly slower or close to equal since a super scalar cpu has multiple execution ports.

Soon as you start getting any larger its easily gonna be slower to do a table scan.

This also gives you O(1) queries and being smaller means that table is more likely to reside in cache for any given look up. This smaller memory footprint is a great boon for any hash algorithm that involves look up.

All excellent points but no matter how impressive the savings it's not going to be interesting if the total time spent checking for membership is low to begin with. You'd need to be optimizing a very tight loop to even notice the difference between brute force and something more sophisticated like a hash map. And you'd need to be pretty damn desperate for more performance to even consider implementing a algorithm described in a paper in the hopes of beating the standard solutions.

It's pretty good fun though, so don't let pragmatism stop you. Just beware of premature optimization and all that.

A hash map is better than a hash table? I don't understand, genuine question, aren't they the same thing or am I getting confused about what you're saying?
Hash map and hash table are the same thing here. That’s what the GP suggests using. (And it’s the right suggestion.)

Bloom filters, cuckoo filters and XOR filters are all types of probabilistic lookup, which are different from normal hash tables.

That’s a very small dataset. Just use a normal HashSet / dict / whatever is normal in the language you’re using.
How does plain binary search perform? 80 KB is not a lot to cache, if you do many lookups misses should go down, right?
Try hyperloglog (HLL) which is another terrific and widely-available algorithm for set membership.

All these algorithms are statistical, and should be checked by hand in the second phase, e.g. not-member is reliable but is-member should be confirmed.

Do you mean like 10,000 valid members or 10,000 candidates (where the set might only contain a small number of valid members?)
Is performance an issue? If not, occam says go with the simplest which is bloom. If performance matters and you are prepared to burn a bit more mem for speed, blocked bloom filters, which are likely optimal for speed and still simple.
Are the ids sparsely distributed?