|
|
|
|
|
by eklitzke
1112 days ago
|
|
I think you're confusing proving that the hash function is collision resistant with the other goal which is hashing speed. If you really need a collision resistant hash you need to use a cryptographic hash function, but outside of cryptographic applications that is rarely the requirement. And (huge caveat, this isn't my domain expertise) I'm not sure what security properties are really "proven" about existing cryptographic hash functions, AFAIK existing cryptographic hash functions are considered secure because we don't know how to break them, not because of some fundamental mathematical property about them. For the other 99.999% of hashing applications there is a balance between collision resistance and hashing latency. For example, in a hash table (probably the most common use for a non-cryptographic hash function) there is a cost incurred by hash collisions because lookups on keys with collisions may have to do extra probing. On the other hand, every hash table lookup requires doing at least one hash operation, regardless of whether or not it collides. So it may make sense to have a slightly worse hash function (in the sense that it is more likely to have collisions with pathological inputs) if it has slightly lower latency. The only way to really know what is faster for a real world application is to have some kind of benchmark to train against as a loss function. |
|
I wish this misconception would die. There is a great theory of algorithmic probabilistic hash functions, completely distinct from cryptographic hash functions. If you are designing a hash table, or a different algorithm using a hash function, you nearly always want the former kind.
The idea is that `Pr[h(x) = h(y)]` is small _no matter the inputs x and y_. Here the probability is over the random seed of h. Lots of good hash functions, like UMASH (https://engineering.backtrace.io/2020-08-24-umash-fast-enoug...) has this guarantee. Other fast hash functions, like MURMUR don't.
When a function doesn't have this guarantee, it means I can find sets of values x1, x2, ... that will likely collide under _any_ or most seeds! Sure, if your inputs are basically random, this probably won't happen, but people can still use this to DDoS your hash table, or whatever you are coding.
Notice again, this has nothing to do with cryptography. It is all about probabilistic guarantees. You can't just test the hash function on a fixed number of inputs and say it's good, since you may just have moved the "bad set" to somewhere else.
In this day and age there are super fast algorithmic hash functions with guaranteed low expected collisions. It's just silly to use one that you can break so easily.