|
|
|
|
|
by tzs
3267 days ago
|
|
For implementations that use cryptographic hash functions, do you actually need more than one hash function invocation per item? For instance, suppose you were implementing a Bloom filter with 2^20 bits, and you want to use 10 bits per item. Instead of hashing the item 10 times, could you hash once with a 256 bit cryptographic hash, and then take the lower 200 bits, divide that into 10 bit strings of 20 bits each, and use those? |
|
To use your example, let's say that the Bloom filter has a size of 2²⁰ bits (128 KB) and that we are using 10 bits per item. In other words, for each new item we need to calculate ten positions in the range [0,2²⁰).
We start by using a cryptographic hash function on the item just once. For example, SHA-256, which will give us a 256-bit value.
Now we need to extract ten blocks from these 256 bits. We can do as you suggest and make each block 20 bits long, but I don't know of a reason why we should not make them a bit longer. A length of 32 bits would be nice, so each block could be neatly mapped to a long.
To get ten blocks of 32 bits out of a 256-bit value, we just calculate a step such that block i starts at the position i * step. In our example, this means that the first block is in positions [0,31], the second is in [22,54]… and the tenth one is in [220,252].
At this point, we have ten blocks of 32 bits and we need ten values in the range [0,2²⁰). So we just turn each block into a long and calculate modulo 2²⁰ for each of them.
And that's pretty much it.
To enter the item in the filter, we set to true the positions corresponding to each of those 10 values.
To check whether an item may be in the filter, we do the same procedure and check that all of the ten positions have been set to true. If any of them is false, the item is not in the filter.