Hacker News new | ask | show | jobs
by rurban 2923 days ago
The best is currently Leonid Yuriev's t1ha. See https://github.com/rurban/smhasher/

Note that in contrast with what Andy or DJB say, the collision safety is not a problem of the hash function per se, as you cannot fix collision attacks with any "safer" hash function. You can easily brute-force even the worst of all siphash in under 4min. Safety begins with 256 bits, in a hash table you got typically 10-14, max 32 to attack. It is only making it marginably safer, but much slower. It is only doable by checking collision attacks in the collision scheme, either by collision counting or using a non-attackable collision scheme.

Unfortunately most implementations have no idea, and just go with the slowest of all. The latest such sin done in the linux network stack. I thought at least those guys were above that, but apparently not.

This simple multiplication hash is indeed superior, just with linear probing it has problems. You would use double hashing with another similar constant then. Or find two such constants dynamically, as this would be universal then.

And also note that the upper bits of a hash function are always superior to the lower bits. power-by-2 & checks are thus always worse than right-shifting as with this one.

2 comments

Thanks. Also, from smhasher text:

"See e.g. A Seven-Dimensional Analysis of Hashing Methods and its Implications on Query Processing

https://infosys.cs.uni-saarland.de/publications/p249-richter...

for a concise overview of the best hash table strategies, confirming that the simpliest Mult hashing (bernstein, FNV*, x17, sdbm) always beat "better" hash functions (Tabulation, Murmur, Farm, ...) when used in a hash table."

The conclusion is not properly stated, it's surely not "always" but there are enough use cases where the simple multiplications (or, even more probably, the "rotate, multiply, xor" and variants) are the fastest in practice.

From the paper:

"We could observe that Mult produces indeed more collisions than the expected amount on uniformly distributed keys. However, this larger amount of collisions does not get highly reflected in the observed performance. Thus, we consider Mult as the best candidate to be used in practice when quality results on high throughputs is desired, but at the cost of a high variance across data distributions."

It’s still a trade-off, but a good choice in some use cases.

I haven't found a usecase yet where a better hash function leads to a faster hash table. Theoretically this would happen with CPUs with extremely slow or small cache.

It's also confirmed with the recent trend to add slower hash functions everywhere, leading to dramatic performance regressions.

> You can easily brute-force even the worst of all siphash in under 4min

You assume that you know the seed or can directly observe the output, which is rare in practice.

> upper bits of a hash function are always superior to the lower bits

Why is that? The identity hash for ints is quite common. And for good hashes, the difference should be negligible.