Hacker News new | ask | show | jobs
by throwawayReply 3002 days ago
I was trying out the javascript example, but managed to get a case where there is a negative conclusion but this goes against the introduction where it says negative conclusions are always definite.

My multiset was:

   a, b, c, d, e, f, g, h, j, abc, def, asd, asds, 3g46stb6vy6vsyosyvosfsdfsdsah, oooooooooooooooo
Then trying oooooooooooooooo a second time, the bloom filter correctly says it might be in the set. The cuckoo filter says it is not.

I assume this is just a javascript bug in implementation, because previously the filter said it might be in the set, actually adding it to the set then gave a false negative when trying it again.

1 comments

Yeah - pardon the crappy javascript and UX. Notice the colors on the cuckoo filter when you insert this sequence. When you try to insert `3g46stb6vy6vsyosyvosfsdfsdsah`, the fingerprint entries in the cuckoo turn red indicating an unsuccessful entry. All possible fingerprint slots in the cuckoo for that entry were already occupied, and the recursive rearranging of entries was unable to free up a slot.

This is a down-side to cuckoos (and counting blooms) - even when the filter has reasonable capacity remaining, an entry can be denied due to collision with previous entries and saturation of their slots.

The UX on this demo could be better - e.g. not insert into the bloom unless the cuckoo insert is successful (to keep them from diverging). Also, a message indicating an unsuccessful insertion would be nice.

I wish this was something that people brought up more prominently when discussing cuckoo filters. They fail catastrophically. A bloom filter gradually fails as the number of items inserted increases and the false positive rate tends to one. But a cuckoo filter that fills up the buckets for a fingerprint has to reject an insert meaning that either you start to have false negatives or to guarantee that you don't have false negatives you have to immediately return true for any query, which is a false positive rate of exactly one with a sharp discontinuity in behavior from before the failed insert. I think the salesmanship of cuckoo filters has been a little overdone and in most applications a bloom filter is a better choice.
Yes, cuckoo filters and counting bloom filters can start rejecting insertions before the filter is loaded - cuckoos because of the inability to find an open slot, and counting blooms because of saturation of a counter. Note that this is only the case if deletion or counting support are required.

In some cases, like the when there's a high likelihood of duplicate entries, this can represent a catastrophic failure mode. This is a good point, and when I get around to updating the tutorial, I'll make note of this.

Regarding the failure mode, an insertion can be knowingly rejected without mutating the filter, so the filter remains useful for subsequent insertions and all reads, retaining all its guarantees. The only loss is the rejection of the item - in our case we used this as signal to rebuild the filter at a larger size.

Why the throaway account, though?