| HN Mirror

Note that you said “likely” and not “equally likely”. Therein lies your answer.

There are applications where hashes are used as identifiers, where it would be impossible to use the original data to resolve possible collisions. One example would be in RTTI (Run Time Type Information), when you want to check if two objects (of unknown-at-compile-time type) are instances of the same type. The only performant way to achieve this is if the compiler emits each type’s RTTI with a unique integer key to efficiently compare against, and this key must be unique across all object files and shared objects. In practice, this is done using hashing. If there is a collision, then the program behavior is undefined, so it is ideal to minimize probability of collision. You can see this issue in the Rust compiler project where there is ongoing discussion about how to handle this hashing, trying to balance compiler speed and collision likeliness:

https://github.com/rust-lang/rust/issues/129014

Another scenario would be like what Backtrace does: a hash is used to identify certain classes of errors in application code. The cost of analyzing each error is quite high (you have to load objects into memory, source maps (if available), perform symbolification, etc), so comparing any two errors every time you want to perform any metrics or retrieve associated data (like issue tracker identifiers) would reduce the performance to a point that any such application would be too slow to be useful. So a lot of thought was given to producing fast, non-cryptographic hashes (these hashes aren’t shared across tenants, and no customer is going to go out of their way to work out initialization parameters so they can be their own adversary) with a known likely hood of collision. You can read about the hash (UMASH) here:

https://engineering.backtrace.io/2020-08-24-umash-fast-enoug...