Hacker News new | ask | show | jobs
by tzs 3267 days ago
For implementations that use cryptographic hash functions, do you actually need more than one hash function invocation per item?

For instance, suppose you were implementing a Bloom filter with 2^20 bits, and you want to use 10 bits per item. Instead of hashing the item 10 times, could you hash once with a 256 bit cryptographic hash, and then take the lower 200 bits, divide that into 10 bit strings of 20 bits each, and use those?

3 comments

Yes, you can do that. Let me describe something similar that I implemented recently.

To use your example, let's say that the Bloom filter has a size of 2²⁰ bits (128 KB) and that we are using 10 bits per item. In other words, for each new item we need to calculate ten positions in the range [0,2²⁰).

We start by using a cryptographic hash function on the item just once. For example, SHA-256, which will give us a 256-bit value.

Now we need to extract ten blocks from these 256 bits. We can do as you suggest and make each block 20 bits long, but I don't know of a reason why we should not make them a bit longer. A length of 32 bits would be nice, so each block could be neatly mapped to a long.

To get ten blocks of 32 bits out of a 256-bit value, we just calculate a step such that block i starts at the position i * step. In our example, this means that the first block is in positions [0,31], the second is in [22,54]… and the tenth one is in [220,252].

At this point, we have ten blocks of 32 bits and we need ten values in the range [0,2²⁰). So we just turn each block into a long and calculate modulo 2²⁰ for each of them.

And that's pretty much it.

To enter the item in the filter, we set to true the positions corresponding to each of those 10 values.

To check whether an item may be in the filter, we do the same procedure and check that all of the ten positions have been set to true. If any of them is false, the item is not in the filter.

What is the tradeoff? Cryptographically secure hashes usually require dozens of rounds and take a lot of time. Could two less secure hashes could do the same thing in less time? What about when the crypto hash is hardware accelerated?

I haven't run the numbers, but I'd bet that two easy hashes still beat the hardware accelerated one by quite a lot (taking time is part of what makes cryptographic hashes secure).

I remember I played with this specific scheme a while ago, and from what I remember it was doing the work well (I did not run a comparison though).