Hacker News new | ask | show | jobs
by RcouF1uZ4gsC 1521 days ago
Exactly. Also SHA-256 is in the same family as MD-5. It would be nice to kind of go over how the family in general works, what weaknesses were discovered, and what changes were incorporated into SHA-256 that addresses these issues.
1 comments

I’ve always been curious about whether it’s possible to have a collision when your input is 256 bits or less (in the case of sha256)… I even emailed Bruce Schneier. I got a polite response that he didn’t have time to look at it, which indicated to me he didn’t know “off the top of his head”, which I interpreted as “it’s not impossible”… but I still don’t know.
There are 2^257 - 1 inputs of 256 bits or less and only 2^256 possible hashes, so there must be a collision.
The input is always padded out to 256 bits, though, isn't it?
The input is padded to form a sequence of 512 bit chunks, but that doesn't change anything, padding is one-to-one so there's still the same number of inputs.
> padding is one-to-one so there's still the same number of inputs

That part is debatable, if we imagine different padding schemes.

If the padding was just 0s, I would easily accept an argument that 111000 and 1110 are the same input giving the same hash.

You could also say you consider the extra '1' bit in SHA-256 as part of the payload, not truly 'padding' because it's mandatory, and make a similar argument, and it wouldn't be blatantly wrong.

> That part is debatable, if we imagine different padding schemes.

It isn't debatable because SHA-256 (and all these schemes) define the padding because it's necessary to have padding defeat the problem you're about to talk about and to do that they are in fact one-to-one:

> If the padding was just 0s, I would easily accept an argument that 111000 and 1110 are the same input giving the same hash.

As you might have guessed, this is not how it works.

SHA-256 appends that "extra 1 bit" you talk about, then zeroes until it is 64-bits short of a multiple of 512 bits, and then a 64-bit count.

So, as the earlier poster explained there are 2^257 -1 distinct hash inputs in abotsis' imagined set and thus there must be collisions by the pigeon hole principle, it really is that simple.

One way to think of hash functions is that the hash of any value is basically a random number. So we can consider what happens if the results were actually random, then what would be the probability of a collision?

Consider a function from [0, 2^256 - 1] to [0, 2^256 - 1]. That is it maps 256bit numbers to 256bit numbers. It could in theory be represented as an enormous table lookup.

Now how many such functions are there? Well there are 2^256 ways to map 0. And then 2^256 ways to map 1.. etc. We have 2^256 inputs we need to deal with each of which can give one of of 2^256 results. That turns into (2^256) ^ (2^256)

Now, how many collision free functions are there? In this case there are 2^256 ways to map 0, but then we have to pick a different number for 1, so there are only 2^256 - 1 possiblites. Then 2^256 - 2 etc... It becomes (2^256)! where the exclamation mark is factorial.

So the probability of no collisions is (2^256)! / [(2^256) ^ (2^256)]. It may not be obvious but that's a very small number. A little bit of intuition: let's say we fill out the values of our gigantic table by hand. Let's assume that by some miracle we filled out the first half without creating any collisions. Now that means we used up half the possible outputs. So every cell we fill out from now we have a 50/50 of creating a collision. And the further we get the more numbers are used up.

Now sha256 is not a randomly chosen function. But without any evidence either way, we might suspect it does have collisions.

To give some further context, I stumbled across this thought while reading how “community ids” are calculated. Community ids are commonly used to simplify joining/lookups for network security tools (suracata, zeek). They essentially concatenate the “quad tuple” (src ip/port, dest ip/port), and a “seed”, then run sha against it. I didn’t entirely understand the reason the authors chose sha (other than being security people who might have just reached for a crypto secure hash function). SHA is slow vs something like xxh, and given the number of sessions these things process, seemed like overkill. Further, it’s unclear to me what’s gained by using sha vs xxh or simply concatenating the bits. Then I started wondering about the downsides: whether it’s possible to have a false correlation because two sessions yielded the same sha digest.
What sort of collision? Or asking in a different way, what would you consider a collision?
2 messages producing the same digest. That’s actually a really interesting question.
I think he means a set-theoretical collision, not a feasible collision attack. I.e. "is SHA256 a bijection on 256-bit integers?"