Hacker News new | ask | show | jobs
by gopalv 2265 days ago
> Do you think Xor filter would be a better fit compared to Bloom here?

The 8byte key is the only scenario where you should consider XorPlus (i.e a 8 bytes mapped to a long).

The lookup properties of the Xor filter are better with that case, but the real question is whether you have an entire collection to start building the bitset or not.

The sketch production isn't incremental - there is no add(k) after building it once.

So you can't build add data once it is built, while the Bloom filters do support adding entries after the fact (in fact, it can add bloom filters into it, rather than sending all the new keys).

And both of those approaches are missing an unset operation.

2 comments

Sorry.. Forgot to mention that.

Yes insertion/deletion is required and the frequency may be higher. The exact usecase is that group membership is dependent on some conditions. We schedule this check every 15 minutes and based on it, we are adding/deleting the members.

Xor filters require all the members of the set be provided up front.

Bloom filters allow adding members, but not removing them.

Cuckoo filters allow removing members: https://www.cs.cmu.edu/~dga/papers/cuckoo-conext2014.pdf

And just for completeness, storing everything losslessly in a btree, radix tree, or hash table would probably be under 300 kB. (80 kB just for the IDs plus, say, an additional ~2x overhead.)

Safe removal is only possible if you know it is already in filter. The 'you can delete from cuckoo filters' bit I feel is often oversold.
Oh wow, I only read the paper's abstract and was fooled. Thanks for the correction.
You can use a counting Bloom Filter that allows deleting. Depending on your update requirements you might also be ok with two bloom filters, one for that has the group memberships and one that has group removals.
10k members every 15 minutes? Just rebuild it from scratch.
Or use a Cuckoo fiter.
> The sketch production isn't incremental - there is no add(k) after building it once.

Which means they can't be used for distributed set intersections, unions or cardinality estimations, a not insignificant side-benefit of bloom filters.