| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by mgunlogson 3488 days ago

Depends on the language I guess. The most efficient way to represent buckets is to pack the buckets right next to eachother in a giant memory block made of bits. The other part of your question about using a flag, your method definitely works but adds an extra bit when you only need one value for zero.

Using bit flags: Using zero as empty, 4 bits could represent numbers 0-15, so 15 values plus empty/zero. With 5 bits you can represent 0-31, so 30 values plus empty/zero. If you're always using one of your bits for the flag it reduces the uniqueness of a fingerprint by almost 50%.

How the filter is built in memory: At least in Java, the best way to build the backing data structure is a giant array of longs. Really what you want is just a giant block of memory you can address by bit offset. Unfortunately modern CPU's can only index into a byte offset. Because of this, buckets and fingerprints can overlap byte boundaries. To make this happen as little as possible, it's best to go the other way and index into the largest primitive structure the CPU supports. In Java at least, this means you want to simulate bit-addressable memory using a long array.

My java filter uses a Bitset implementation borrowed from Apache Lucene because the Java builtin Bitset only allows a 32 bit index(same with java array indices). You would think this limits you to a 2GB filter but it's actually only around 270MB since the Bitset indexes are a 32 bit int. Anyways, the Lucene Bitset uses a long array to simulate bit addressable memory.

Using regular arrays of booleans doesn't work in higher level languages unfortunately. They tend to use CPU supported primitives underneath so for example a boolean could take up a byte underneath. Every Cuckoo filter library I know of besides the reference implementation and mine doesn't handle this properly and is extremely space-inefficient :(.