If you don't care about forged collisions, what makes a secure-ish checksum bette than a faster insecure checksum? Both are likely to result in a different value for bit-flipped input, which is the point.
Note that you said “likely” and not “equally likely”. Therein lies your answer.
There are applications where hashes are used as identifiers, where it would be impossible to use the original data to resolve possible collisions. One example would be in RTTI (Run Time Type Information), when you want to check if two objects (of unknown-at-compile-time type) are instances of the same type. The only performant way to achieve this is if the compiler emits each type’s RTTI with a unique integer key to efficiently compare against, and this key must be unique across all object files and shared objects. In practice, this is done using hashing. If there is a collision, then the program behavior is undefined, so it is ideal to minimize probability of collision. You can see this issue in the Rust compiler project where there is ongoing discussion about how to handle this hashing, trying to balance compiler speed and collision likeliness:
Another scenario would be like what Backtrace does: a hash is used to identify certain classes of errors in application code. The cost of analyzing each error is quite high (you have to load objects into memory, source maps (if available), perform symbolification, etc), so comparing any two errors every time you want to perform any metrics or retrieve associated data (like issue tracker identifiers) would reduce the performance to a point that any such application would be too slow to be useful. So a lot of thought was given to producing fast, non-cryptographic hashes (these hashes aren’t shared across tenants, and no customer is going to go out of their way to work out initialization parameters so they can be their own adversary) with a known likely hood of collision. You can read about the hash (UMASH) here:
Well, also notice that I said "bit-flipped input." I.e., the use case I believed to be talking about and responding to was data corruption detection (which is also associated with "checksum" as a term of art, as opposed to a generic "hash"). I agree that hashing for identity is a different use case than checksumming for data integrity.
FWIW I think commonly used checksums are basically ideal ("equally likely" to come to a different value) on single-bit flipped inputs, but don't quote me on that.
That’s a good question. A cryptographic hash must not leak any information other than “yes” or “no”. A checksum-type hash might have an additional property where data that only has a single bit flipped might hash to something close to the original hash, allowing you to refine your collision attack by knowing if you’re hot or cold.
For a checksum, this isnt a bad thing and actually could be good.
There are applications where hashes are used as identifiers, where it would be impossible to use the original data to resolve possible collisions. One example would be in RTTI (Run Time Type Information), when you want to check if two objects (of unknown-at-compile-time type) are instances of the same type. The only performant way to achieve this is if the compiler emits each type’s RTTI with a unique integer key to efficiently compare against, and this key must be unique across all object files and shared objects. In practice, this is done using hashing. If there is a collision, then the program behavior is undefined, so it is ideal to minimize probability of collision. You can see this issue in the Rust compiler project where there is ongoing discussion about how to handle this hashing, trying to balance compiler speed and collision likeliness:
https://github.com/rust-lang/rust/issues/129014
Another scenario would be like what Backtrace does: a hash is used to identify certain classes of errors in application code. The cost of analyzing each error is quite high (you have to load objects into memory, source maps (if available), perform symbolification, etc), so comparing any two errors every time you want to perform any metrics or retrieve associated data (like issue tracker identifiers) would reduce the performance to a point that any such application would be too slow to be useful. So a lot of thought was given to producing fast, non-cryptographic hashes (these hashes aren’t shared across tenants, and no customer is going to go out of their way to work out initialization parameters so they can be their own adversary) with a known likely hood of collision. You can read about the hash (UMASH) here:
https://engineering.backtrace.io/2020-08-24-umash-fast-enoug...