Hacker News new | ask | show | jobs
by Nilithus 1376 days ago
I was wondering anyone could shed some light on the "These hash functions should be pairwise independent" part. I don't know what pairwise independent means. My background is mostly in web so I am familiar with things like `sha256`, `sha1`, `md5` things like that. Would you use these kind of functions for Count-min Sketch?

I saw in your socklimit project looks like `fasthash64` and `hashlittle` I'm not familiar with those any insight or recommended reading to understand these hash functions?

p.s. googling pairwise independent hash functions did get me some college class reading but doesn't mention any named hash functions developed out in the world.

3 comments

Those hash functions are (or were) cryptographically secure. They aim to prevent many different kinds of attacks that may be launched on hashed data. There are many other noncryptographic ways to use hashes (like writing your own dictionary) that do not need the protections cryptographic hashes give you and therefore can trade those protections for performance.

Pairwise independence basically means that applying 2 different hash functions to the same key produces 2 distinct/seemingly random values. There's a much more precise mathematical definition but that's the essence.

You probably don't need the hash to be cryptographically secure. The simplest solution might be to use several uniquely keyed siphash instances, as siphash is quite cheap to compute, was designed for use in hash tables, and two siphashes with different keys behave pairwise independent, which roughly means there are no observable correlations between them.

[1] https://en.wikipedia.org/wiki/SipHash

I think we tried siphash at some point, unfortunately it's significantly slower than the two hash functions we ended up using.
We ended up with fasthash64 and lookup3 by looking for a fast hash that is easy to port to the restricted subset of C supported by eBPF with minimal changes. https://github.com/rurban/smhasher is a great resource for that.

I would probably choose different, more robust hash functions if I was targeting regular C.