Hacker News new | ask | show | jobs
by mgunlogson 3494 days ago
The thing with Cuckoo filters is that the item's alternate position is determined solely from it's current position and fingerprint.

Mathematically this is done by XOR'ing with a magic mixing constant borrowed from Murmur2 :) (or in my lib from Mumur3). The quality of the randomness of alternate bucket positions is...probably bad, but in practice it seems to work fine.

Because of this finding alternate positions doesn't require a rehash per se, just a few CPU instructions.

However, there IS a corner case that requires re-hash that comes up more frequently the shorter the fingerprints used. Conceptually a tag/fingerprint of zero means empty bucket, and occasionally the fingerprint hash for an item will end up being zero. In this case I've seen some libraries just add 1, which I didn't really like since it increases hash collisions slightly. Or you can do it like my lib does and just re-hash the item with a salt.

1 comments

Hm, the empty/zero issue is quite interesting. How are buckets usually represented? I mean, if it were just an array of numbers, that problem wouldn't exist. If it's e.g. multiple bit sequences "bit-or-ed" into one integer, couldn't we use one additional bit per fingerprint to indicate whether it's there?

    3-bit fingerprints + sign, v: "signs"

    v      |v        |v      |v
    1 0 0 0 1 1 0 1 0 0 0 0 0 0 0 0 0
Depends on the language I guess. The most efficient way to represent buckets is to pack the buckets right next to eachother in a giant memory block made of bits. The other part of your question about using a flag, your method definitely works but adds an extra bit when you only need one value for zero.

Using bit flags: Using zero as empty, 4 bits could represent numbers 0-15, so 15 values plus empty/zero. With 5 bits you can represent 0-31, so 30 values plus empty/zero. If you're always using one of your bits for the flag it reduces the uniqueness of a fingerprint by almost 50%.

How the filter is built in memory: At least in Java, the best way to build the backing data structure is a giant array of longs. Really what you want is just a giant block of memory you can address by bit offset. Unfortunately modern CPU's can only index into a byte offset. Because of this, buckets and fingerprints can overlap byte boundaries. To make this happen as little as possible, it's best to go the other way and index into the largest primitive structure the CPU supports. In Java at least, this means you want to simulate bit-addressable memory using a long array.

My java filter uses a Bitset implementation borrowed from Apache Lucene because the Java builtin Bitset only allows a 32 bit index(same with java array indices). You would think this limits you to a 2GB filter but it's actually only around 270MB since the Bitset indexes are a 32 bit int. Anyways, the Lucene Bitset uses a long array to simulate bit addressable memory.

Using regular arrays of booleans doesn't work in higher level languages unfortunately. They tend to use CPU supported primitives underneath so for example a boolean could take up a byte underneath. Every Cuckoo filter library I know of besides the reference implementation and mine doesn't handle this properly and is extremely space-inefficient :(.