| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by didericis 1620 days ago
	Theoretically couldn’t there be a hashing algorithm that’s one to one if it always spits out a hash as long or longer than the input message? I’ve never actually walked through the math behind hashing algorithms, but I’m assuming collisions come from truncation. I’m guessing you’re usually not able to know exactly where two inputs that collide for the first n bits end up diverging, so the only way to ensure most hash functions are one to one is if the outputs have infinite length. But, if you had outputs of infinite length for different inputs, eventually they’d have to diverge. Idk if that’s true of all hashing functions/maybe there’s a way to know after what point outputs for different inputs have to diverge for some.

4 comments

oconnor663 1620 days ago

Most cryptographic hash functions in practice mix their input block-by-block into some "state" that's of a fixed size. This lets you implement them with a small, constant memory footprint, which is important.

For older designs like MD5, SHA-1, and SHA-256, the final hash is literally that state, just serialized into bytes and returned to the caller. (This is what makes "length extension attacks" possible on these hashes, which is why we need constructions like HMAC.) For newer designs like SHA-3 and the BLAKE family, the output is some function of the state, which prevents length extension attacks. This also makes it easy for these functions to offer "extendable output" features, i.e. as many output bytes as you like. (SHA-3 isn't standardized with this feature, but the very closely related SHAKE functions will gladly give you outputs of any length.)

However, one important thing to realize about these functions is that extended outputs do not increase security. This is counterintuitive, because we're used to distinctions like SHA-256 vs SHA-512, with the larger output providing more security in some sense. That's true, but it requires SHA-512 to keep a larger state in addition to producing a larger output. SHAKE128 and BLAKE3 always use the same state size, regardless of how many output bytes you ask for, and if you produce a collision in that state, all the output bytes will collide too.

Another commenter mentioned perfect hash functions, and my understanding of those is that they typically require the input set to be of some fixed size. If the input set is "any possible string", which it pretty much is for cryptographic hashes, I think trying to design a perfect hash function starts to get weird? At the very least, the state you need to keep will be proportional to the longest message you want to hash.

garaetjjte 1619 days ago

>This is what makes "length extension attacks" possible on these hashes, which is why we need constructions like HMAC

Or just truncate the hash so you don't get full state, like SHA-512/256.

oconnor663 1619 days ago

Yes, SHA-512/256 is often a good option, despite its tragically confusing name. But (maybe going off on my own tangent here) it's got a few important shortcomings:

1. It's less widely supported than SHA-256 or SHA-512. For example, it's present in OpenSSL but not in Python's hashlib. A reasonable workaround here is to just truncate SHA-512 yourself, which gives you a functionally similar but not-output-compatible hash. But no one loves monkeying around with standards like this.

2. SHA-512 doesn't have any hardware support that I'm aware of. SHA-256 has dedicated hardware acceleration on lots of ARM chips and recent x86 chips, which is very nice. But SHA-256 doesn't have enough margin in the digest size to truncate it. (SHA-224 is a thing, but it dumps more collision resistance than we're really comfortable with, in exchange for being only slightly resistant to length extensions.)

3. If your goal is to construct an XOF, SHA-512/256 is kind of perversely inefficient. You end up throwing away half your output bytes to prevent length extensions, but then running the hash (or at least the compression function) again to get more output bytes. On the output performance side, this is leaving a free factor of two on the table.

didericis 1620 days ago

This is very helpful, thank you. This whole thread is making me realize I should read up more on hash differences/implementations.

oconnor663 1620 days ago

(shameless plug) If you want to start by doing your own implementation of SHA-256, you can take a look at one of my assignments :) https://github.com/oconnor663/applied_crypto_2021_fall/tree/...

mistercow 1620 days ago

Not so theoretically. Perfect hash functions have exactly that property, although I've never heard of a perfect cryptographic hash function. That concept seems inherently contradictory.

3np 1620 days ago

Arguably per-definition a hash function takes arbitrary-size input and produces fixed-length output - change either of those and it's no longer a hash function. Don't and the pigeonhole principle guarantees infinite theoretical collisions.

AlexSW 1620 days ago

The output length/size of a hash function is fixed, whereas it takes an arbitrary-length/size input.

didericis 1620 days ago

I know it's normally fixed, but I did a quick google and saw a stack overflow answer saying there are some algorithms that allow for variable length outputs: https://crypto.stackexchange.com/a/3564

That doesn’t necessarily mean you can figure out what length output for a given input is needed to make it one to one. Not sure you could avoid collisions even if the length of the output was infinite, but I’m assuming different inputs have to have outputs that diverge at some point.

SAI_Peregrinus 1620 days ago

The variable output length functions are called eXtensible Output Functions (XOFs). An XOF isn't a cryptographic hash function, though they can be very similar (and can have an identical internal function doing the work).

We use different words for Cryptographic Hashes, Password Hashes, XOFs, MACs, and (non-cryptographic) Hashes because they do different things and have different security properties. Misusing the terminology makes reasoning about what is meant difficult.

dchest 1620 days ago

They all have a fixed-length internal state and will have internal state collisions regardless of the output size. (Basically, they "consume" input into the state and then "expand" the resulting state into the output.)

Suggested reading: Sponge Functions - https://keccak.team/files/SpongeFunctions.pdf

"Informally speaking, a random oracle* maps a variable-length input message to an infinite output string. It is completely random, i.e., the produced bits are uniformly and independently distributed. The only constraint is that identical input messages produce identical outputs. A hash function produces only a fixed number of output bits, say, n bits. So, a hash function should behave as a random oracle whose output is truncated to n bits."

* https://en.wikipedia.org/wiki/Random_oracle

7steps2much 1620 days ago

You are correct, at least in theory.

Assume you have two inputs, A and B.

A may hash to: A38uT75kjGz B may hash to: A38uHso629t

So yes, if you were to cut these off after the A38u then you would no longer be able to say for sure if you hashed A or B to arrive at your hash.

Of course in practice this usually isn't a problem as long as you have "a long enough" output.

bitkrieg 1620 days ago

Your example made me wonder, is there a known instance from a common hash algorithm where the input results in exactly the same string representation of the output hash? Eg. "AE485D" hashes to "AE485D". Is this even mathematically possible?

morelisp 1620 days ago

The mathematical term for this is a "fixed point", where f(x) == x.

Assuming a perfectly random uniform distribution, the usual desirable property of a cryptographic hash - the probability of a hash function not having a fixed point (that is, hashing at least one x to itself) is (1-1/n)**n, where n is the number of possible outputs. As n approaches infinity - which it does pretty rapidly in this case, since we're talking about 2**32 to 2**512 in practice - this approaches 1/e, or about 37%.

So, not only is it possible, but most "good" hash functions (63% of them) will have them.

kevinventullo 1620 days ago

Java’s built-in hash function for integers is the identity function.

charcircuit 1620 days ago

The modulus function has this property.

charcircuit 1620 days ago

Those are XOFs (extendable output functions), not hash functions.