| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by mgunlogson 3494 days ago

Hmmm.... this might be an artifact of the python implementation?

The insert speed should be roughly constant until the filter begins moving items into their alternate buckets(when one of the two buckets for that item is full). In my implementation this doesn't usually happen until well over 90% capacity.

Insert speed will slow slightly as the filters fills up if the implementation uses a shortcut to determine the first empty bucket position by just checking if it's zero. Conceptually, if a bucket is empty you only need to check the first position before the insert. When the filter is half full you need to check ~%50 of bucket positions before an insert. In practice the bucket operations are so fast and buckets only hold 4 items each, I haven't noticed a difference.

1 comments

ScottBurson 3494 days ago

Yes, I understand why it slows down (though the additional details you've provided are helpful). My point was that slowing down is correctly described as a decrease, not an increase, in throughput.

Since you've implemented one, maybe you can answer a couple more questions I have:

> Typically, a Cuckoo filter is identified by its fingerprint and bucket size. For example, a (2,4) Cuckoo filter stores 2 bit length fingerprints and each bucket in the Cuckoo hash table can store up to 4 fingerprints.

Wouldn't that fingerprint length normally be two bytes rather than two bits? The Python sample code given appears to use bytes.

Also, when you start doing deletion, isn't there a nonzero probability of causing false negatives because of fingerprint collisions? I guess a 2 byte fingerprint would make that probability acceptably small, but it doesn't seem to be zero. Correct?

mgunlogson 3494 days ago

The code should use bits, the point of the filter is to increase space efficiency so you need to pack everything together as close as possible. Having fingerprint sizes of only multiples of 8 bits doesn't make sense.

Also, from the quote, it's probably more accurate to say a Cuckoo filter is identified by it's fingerprint size, bucket size, and capacity. In most implementations the bucket size is fixed at 4, so you can define a filter with only fingerprint size and capacity.

And yes, deleting can cause false negatives but only in a roundabout way. Deleting will only cause false negatives if you delete something that wasn't added previously. Because the filter has false positives, sometimes it will tell you it's seen an item that it hasn't. If you delete one of these false positives it will cause a false negative. If you only delete items that have been added before, and in the case of duplicate items, added the same number of times, you won't get any false negatives.

Essentially if you saved a log of all the inserts then replayed it as deletions you would have a perfectly empty filter every time with no false negatives. I actually have a unit test that does this in my java Cuckoo filter :), it exorcised quite a few demons from the code.

If you're curious the unit test is sanityFillDeleteAllAndCheckABunchOfStuff() in https://github.com/MGunlogson/CuckooFilter4J/blob/master/src...