|
|
|
|
|
by didericis
1620 days ago
|
|
Theoretically couldn’t there be a hashing algorithm that’s one to one if it always spits out a hash as long or longer than the input message? I’ve never actually walked through the math behind hashing algorithms, but I’m assuming collisions come from truncation. I’m guessing you’re usually not able to know exactly where two inputs that collide for the first n bits end up diverging, so the only way to ensure most hash functions are one to one is if the outputs have infinite length. But, if you had outputs of infinite length for different inputs, eventually they’d have to diverge. Idk if that’s true of all hashing functions/maybe there’s a way to know after what point outputs for different inputs have to diverge for some. |
|
For older designs like MD5, SHA-1, and SHA-256, the final hash is literally that state, just serialized into bytes and returned to the caller. (This is what makes "length extension attacks" possible on these hashes, which is why we need constructions like HMAC.) For newer designs like SHA-3 and the BLAKE family, the output is some function of the state, which prevents length extension attacks. This also makes it easy for these functions to offer "extendable output" features, i.e. as many output bytes as you like. (SHA-3 isn't standardized with this feature, but the very closely related SHAKE functions will gladly give you outputs of any length.)
However, one important thing to realize about these functions is that extended outputs do not increase security. This is counterintuitive, because we're used to distinctions like SHA-256 vs SHA-512, with the larger output providing more security in some sense. That's true, but it requires SHA-512 to keep a larger state in addition to producing a larger output. SHAKE128 and BLAKE3 always use the same state size, regardless of how many output bytes you ask for, and if you produce a collision in that state, all the output bytes will collide too.
Another commenter mentioned perfect hash functions, and my understanding of those is that they typically require the input set to be of some fixed size. If the input set is "any possible string", which it pretty much is for cryptographic hashes, I think trying to design a perfect hash function starts to get weird? At the very least, the state you need to keep will be proportional to the longest message you want to hash.