Hacker News new | ask | show | jobs
by mgunlogson 3484 days ago
Hi there, I've wrestled with Cuckoo filters myself, so I took a look at your lib and there's a some things that stood out.

Fingerprint size- It allows fingerprints that are too short, basically less than 4 bits doesn't allow a reasonable fill capacity. The paper authors only hinted at this, but check out the fill capacity graphs on page 6. This could be why your inserts are slowing down around 80% level when in my experience it doesn't happen till around 93%.

Modulo bias - Your filter capacity code doesn't seem to round the filter size to a power of two. This is a simple fix, but without it your array indexes will be skewed by modulo bias, possibly badly if someone picks a nasty number.

Alternate bucket positions - Your code seems to do a full hash for calculating alternate bucket positions. I know the paper mentions this, but I haven't seen anyone actually doing it :). Its a lot faster to just XOR with a mixing constant. TBH that's what most libraries are doing... whether it's a good idea is debatable :).

Fingerprint zero corner case - I'm not that great at python, but I didn't see any special handling for the rare case that the hash fingerprint is zero. In most implementations zero means empty bucket, so this could violate the "no false negatives" aspect of the filter by making items rarely disappear when they were supposed to be inserted. Most implementations either just add one to it, but I prefer to re-hash with a salt.

No victim cache - Didn't look too much into it, but I didn't see a victim slot used in your code. This will cause problems when the first insert fails. The problem is, by the time the first insert actually fails, you've relocated a bunch of different fingerprints like 500 times. It becomes unclear which fingerprint you originally tried to insert, and you're left holding a dangling item from a random place in the filter that you cannot insert. This violates the "no false negatives" mantra. Even though the filter is full it shouldn't break by deleting a random item when the first insert fails. You either need to store this last item or figure out a way to unwind your insert attempts to be able to reject the item that originally failed to insert.

Check out my Java library if you want to see how I dealt with these things. Also I have a bunch of unit tests there that I either came up with or borrowed from other Cuckoo libs. Should be pretty easy to convert some of those to python :) .

https://github.com/MGunlogson/CuckooFilter4J

Cheers, Mark

1 comments

Hi Mark,

One of the authors here. Great points especially on the "No victim cache" aka insertion failure. This was mostly a test implementation for us to see how cuckoo filters work, but agree that it is critical to have these issues fixed.

I'll check out your library for improvements.

thanks!

No problem, The paper authors left a lot of the specifics off... I got burned by the victim cache too. I saw one in their reference code and thought it was pointless until weeks later when I got around to writing tests and had one of those aha moments.