Hacker News new | ask | show | jobs
by natch 3267 days ago
I disagree with the standard dogma around bloom filters that you need multiple hash functions. Just use a simple incrementing salt value to modify the input so you can hash the resulting salted input as many times as you need to, using a different salt value each time.

Say you want to hash the string "abc" 8 times. Instead of having 8 hash functions, just take the hash of, say, "abc-0", "abc-1", ... "abc-7". As long as your hash function is of good quality and you're treating the output properly, you don't have to worry about the results from these different inputs having some significant relationship. For cryptographic types maybe I'm using the term "salt" loosely. Whatever, think of another term if you like. Anyway in practice this has worked fine for me.

5 comments

See "Less Hashing, Same Performance: Building a Better Bloom Filter": https://www.eecs.harvard.edu/~michaelm/postscripts/rsa2008.p.... This is pretty standard practice now.

There've been a couple of other analyses (at least) of the importance of independence among hash functions in probabilistic data structures --- which if I remember correctly was "not as much as you'd think" --- but I don't remember the titles of the papers offhand.

Standard practice? Someone should tell all the fresh CS grads who keep blogging about their multiple hash algorithms. I've been doing this as standard practice since the 1990s. I'm glad if Computer Science academics are catching up to the working practitioner.
> catching up

Not quite. The n-hashing methods described in that paper are rather different from yours. They're significantly faster, for one thing, which matters if you're building e.g. high-performance networking products.

An equally important contribution of the paper though is providing a framework for proving the reliability of such methods. There are areas of software development where "in practice this has worked fine for me" isn't good enough.

I see, I misunderstood your first comment. I took it as you posting a paper about the technique I described, and when you said "This is pretty standard practice now." I thought you meant "this, what natch just said in his comment," as opposed to "this, what the paper said."

So they're still doing multiple hash functions as standard practice, you're saying? Wow.

Yes their methods are rather different I see.

Indeed using multiple salts is slower than two hashes.

What I do now instead of that (the salts method was easier to explain) is just one hash, currently SHA-1 but may move to BLAKE2 soon, and chop the result of that up into 20-bit pieces that I use as hash values, modulo the bit array size. How do you think this would compare to what they do?

> What I do now

Same here (and if you're interested in trying out some new hash functions for performance-related reasons, I'll recommend fast non-cryptographic options like xxHash and MurmurHash). You can still use your chopped-up hashes to generate additional values if needed, depending on the desired error rate.

Do watch out for modulo bias when mapping the chopped-up hashes to the array, though. For max hash value >>> array size the bias is going to be vanishingly small, but if you're clipping the hashes to the nearest power of two greater than the array size, things could get ugly.

That's just an easy way to create new hash functions out of an existing one. Nobody sensible would insist that your hashes must have different underlying algorithms, just that they produce uncorrelated outputs.
You haven't been reading the blog posts on bloom filters. Or maybe you're saying they're not sensible, in which case I agree.
Haven't been reading is definitely true. Not sensible sounds like the case.

If you're using a cryptographic hash, your method clearly works. If you're using a fast and crappy hash function, you might need more care. But I suspect not.

As always, test with your real-world data. If it does what you need it to, theoretical qualms are likely irrelevant.

> I disagree with the standard dogma around bloom filters that you need multiple hash functions.

There is no dogma there, just a confusion of terminology. Your "abc-0", "abc-1", ... "abc-7" could be considered a family of hash functions, with the suffix as a parameter. "Use this algorithm on the input + a suffix of '-0'" is a different hash function than "Use this algorithm on the input + a suffix of '-1'"

Similarly, splitting a long or infinite length output of a hash function into smaller pieces is a perfectly acceptable way of constructing multiple independent hash functions.

When a paper talks about "multiple hash functions", that means exactly what it says, but no one is saying that you need multiple hash algorithms. It might mean making multiple hash functions using universal hashing, or through techniques like the "Less Hashing, Same Performance" paper people have linked to, or just using a prefix/suffix on the input like you suggest. No one is using CRC32 and MD5 and SHA1 together as their multiple hash functions, that's silly, unnecessary, complicated and doesn't scale.

The only problem is this assumes that the hash unpredictably changes on input modification. Which is true for cryptographic hashes but not others.
It assumes nothing of the sort. [Edit: well I mean it RELIES on it yes but it's not an assumption; it's verified by knowing the properties of the hash function used.] But it perhaps hinges on your definition of "unpredictably."

If you're taking the modulus of a large integer with respect to a very much smaller bit array with a length that is a prime number, there is plenty of unpredictability with most decent hash functions (note I did have "good quality" as a caveat above) even non-cryptographic ones. That being said, I do use cryptographic hashes.

basically, ideally you would want all of your hash functions to belong to the same universal hash family[1].

[1]: https://en.wikipedia.org/wiki/Universal_hashing

I think I actually wrote a section on doing this and then wrote over it at some point :(

PRs welcome, I'm on vacation and unlikely to get to it immediately. http://github.com/llimllib/bloomfilter-tutorial