First practical SHA-256 collision for 31 steps. fse2024 | HN Mirror

Y	Hacker News new \| ask \| show \| jobs

	First practical SHA-256 collision for 31 steps. fse2024 (twitter.com)
	179 points by devStorms 819 days ago

7 comments

lifthrasiir 819 days ago

It took me a lot of head scratching to exactly understand what this means, so for your information: this is not a full attack and you are safe (for now). If you need a concrete proof:

    import hashlib
    m0 = bytes.fromhex('''
        c32aef52 512294ba 9db5ed8c 8c8c88ed b2de2765 63a2d14e ec7619cc 93b21182
        e5050f50 f0839b60 7b1ee176 aaa06d68 c462343c 67898962 9558f495 04281f2c
    ''')
    m1 = bytes.fromhex('''
        5d0f5ae6 05e98311 8fa3c73a 9af8c49d a2bf31f7 de547b67 5baecee3 da0d8c94
        e4c19564 f682d45c f7c57698 f871f9b5 f14469b7 fc28eb0c 2d76db75 043fe071
    ''')
    m1p = bytes.fromhex('''
        5d0f5ae6 05e98311 8fa3c73a 9af8c49d a2bf31f7 de548b61 5b8e46f2 8a1dd69a
        bcc08464 f6825458 f7c57698 f871f9b5 f14469b7 fc28eb0c 2d76db75 043fe071
    ''')
    print(hashlib.sha256(m0 + m1).hexdigest())
    # 2627577ac401cf44d837cf8471cac13ad7d8385bd00e4daf59fd3c3c646eaaae
    print(hashlib.sha256(m0 + m1p).hexdigest())
    # c945222bf0868a2218d5683c69b2b6c4720093e40c46d1197262d991e4d483b6

As far as I can understand, this is same as [1] and the first practical semi-free-start collision of 31 out of 64 rounds of SHA-256, at the complexity of 2^49.8. "Step" here equates to "round", which is not always the case and I was much confused. (RIPEMD-160 for example has 5 rounds and 16 steps per each round.) There are other theoretical cryptanalyses with more rounds of SHA-256, but this one is fairly practical and the group has explicitly demonstrated. But it is still far from the full collision attack or more like MD5 suffered back in 2009.

(By the way I couldn't exactly reproduce the claimed result even with a 31-round version of SHA-256. Maybe they simply ran a step function 31 times without any initial rounds? I don't know.)

EDIT: @Retr0id has reproduced this result: https://bsky.app/profile/retr0.id/post/3konobbmf6o2a

[1] https://eprint.iacr.org/2024/349.pdf

wongarsu 819 days ago

There was a practical collision attack on 28 rounds in 2016. Only 3 rounds of progress in 8 years is a pretty good sign for sha256.

For new code it might be better to use blake2b, blake3 or sha3, but at the same time I don't think there is any rush to migrate existing systems away from sha256.

egberts1 819 days ago

Better off with SHAKE256: none of that "oops, I went with easier SHA3-224", plus SHAKE256 is faster.

lifthrasiir 818 days ago

Indeed. SHA-2 is unexpectedly stronger than the expectation a decade ago.

layer8 819 days ago

“Steps” means “rounds” here. For the general advances see the table under https://en.wikipedia.org/wiki/SHA-2#Cryptanalysis_and_valida... .

In 2016 there was a practical collision attack for 28 rounds. At that rate of progress, a practical collision attack for all 64 rounds would be reached in around 90 years from now.

tptacek 819 days ago

This is a good time to re-read JP Aumasson's "Too Much Crypto" post:

https://eprint.iacr.org/2019/1492.pdf

The comparison is probably broken in a variety of ways, but the Keccak team proposed KangarooTwelve, a 12- (1/2 as many) round Keccak variant, after a practical attack on 6-round Keccak was published.

GoblinSlayer 818 days ago

I noticed blake3 uses 7 doublerounds, i.e. 14 chacha rounds. Is it intended due to increased communication or another bug?

jl6 819 days ago

I assume “steps” here means rounds? For reference, standard SHA-256 is 64 rounds.

H8crilA 819 days ago

SHA-2, including SHA-256, is constructed using a Davies–Meyer compression function. That compression function starts with a block cipher - so an object like AES, but with wider keys and wider block size. For SHA-2 this block cipher is called SHACAL-2.

Now what we're seeing here is an attack on SHA-2 assuming a very, very significant degradation in SHACAL-2, where we run far fewer rounds than assumed in the standard. This is your typical cryptoanalytical result, interesting, but it is very very far from showing that "SHA-2 is broken".

As a side note I once estimated that the Bitcoin network is likely to produce a collision in SHA-256 sometime in 2050s, assuming the current rate of growth of the hash throughput. Of course that's a big assumption, and also nobody will notice the collision, as nobody is saving all those past hashes :)

Another side note - if you're interested in learning about hash functions then I recommend looking into SHA-3. Not because it's newer and shinier, but because I think it is actually the easiest to understand. It has a very clever design.

panzi 819 days ago

I wonder, given the current rate of development when will there be the first collision in the hashes of the Linux kernel git repository. Wait, did git finish the switch to SHA-256 or is it still using SHA-1. Googling... all I can find suggests that everyone is still using it with SHA-1 and SHA-256 repos aren't compatible with SHA-1 repos (whatever that means exactly).

CuriousCosmic 819 days ago

So tldr is "it's in progress".

You can use SHA-256 in production. And you can convert SHA-1 repos into SHA-256 repos.

However:

- SHA-1 repos are not compatible with SHA-256 repos so you can't mix and match the trees (i.e. a SHA-256 fork couldn't upstream their commits to a SHA-1 repo).

- The conversion path from SHA-1 to SHA-256 will break all GPG signatures on the repo.

- There may be breaking changes to the SHA-256 repository implementation in the future however those changes will be guaranteed to come with an upgrade path for any users of the existing SHA-256 implementation.

So it's viable as an option but it's by no means "blessed" like the existing SHA-1 impl is.

H8crilA 819 days ago

I would only add that an organic (accidentally created) hash collision in Git will take an extreme amount of time. However, even today you can download the two PDFs from https://shattered.io/, put them both in your Git repository and watch Git crash. Given the construction of SHA-1 (Merkle-Damgard), it is easy to create an unlimited amount of derivative files that also cause a collision, they just have to have the correct prefixes (and then arbitrary but identical suffixes). Or upload only one of such files, but later pretend that it was the other. Authors were even kind enough to create a file tester on that very website :), but note that a determined adversary can recreate the attack and create a different set of prefixes.

SHA-1 really is broken, and therefore standard Git repositories do not offer integrity protection against someone who is determined to do harm and has some resources.

formerly_proven 819 days ago

git has been using the hardened variant of SHA-1 for ages, so the shattered.io files haven't had that effect for a long time.

Edit: Since git 2.13, released about a month after SHAttered was published in 2017: https://github.com/git/git/blob/master/Documentation/RelNote...

free_bip 819 days ago

Additionally, AFAIK, none of the major repo hosting services (GitHub, gitlab, Bitbucket) support sha-256 repos.

CuriousCosmic 819 days ago

This is true however that is changing very soon now that SHA-256 is no longer marked experimental.

GitLab has been working on integrating SHA-256 support for a while. According to this comment[1], there's only one major blocker left (which seems to be completed at the time of this comment) before they can start testing SHA256 support on GitLab.org.

1. https://gitlab.com/groups/gitlab-org/-/epics/10981#note_1797...

panzi 819 days ago

Thanks!

slau 819 days ago

Yes, the follow-up post (hidden by default) reads:

> Don’t panic, folks. This is very good work, especially given the low memory complexity of this attack. But there are 33 steps left. Your bitcoins are safe.

popol12 819 days ago

Bitcoin is using double sha256, just in case someone is wondering.

Though I wonder if double sha256 makes it twice harder to break or if it's better or lower than that.

nabla9 819 days ago

Frank @jedisct1

>Wouldn’t help in that case. Collision resistance of a composition degrades to the one of the weakest function (it’s even slightly worse). Double SHA2 only protects against length extension attacks. https://twitter.com/jedisct1/status/1772911384356868586

Karliss 819 days ago

I don't think double sha256 makes any difference with regards to collisions. If there is a collision after single sha256 they will still collide after second layer of hashing sha256(x)=sha256(y) => sha256(sha256(x))=sha256(sha256(y)).

rowbin 819 days ago

But your are going backwards though. You have a sha-256 value and want to find an input with the same result. But this input again has to be a sha-256 result and you need to find an input for that as well, right? This would only work if you have the intermediate sha-256 value, that produces the final sha-256 or you can find a collision that itself is a sha-256 value.

glandium 819 days ago

Going backwards, as you say, is called a pre-image attack. That's different from a collision attack, which is generating two inputs with the same hash.

Pre-image attacks are MUCH more difficult. How much more? well, MD-5 is considered broken, and yet, there isn't one for it.

tialaramex 819 days ago

There is a pre-image attack for MD5, it's just not considered good enough to be practical. Quoting Wikipedia:

> In April 2009, an attack against MD5 was published that breaks MD5's preimage resistance. This attack is only theoretical, with a computational complexity of 2123.4 for full preimage.

H8crilA 819 days ago

Yes, but that's very little improvement over the generic 2^128 attack - trying random messages until one happens to match the target hash. The attack quoted by Wikipedia achieves only 4.6 bits of speedup (note that it's 2^123.4, not 2134.4 :) ). There are attacks of this sort against many cryptographic primitives, including AES, where you can gain just a few bits over the generic / brute force attacks.

popol12 819 days ago

Let's say I have a string S.

MD5(MD5(S)) = Y

Now, I find a collision string SS (of length 128 bits, like an MD5 hash), where MD5(SS) == Y

Then I find a collision string SSS (this time, length doesn't matter), where MD5(SSS) == SS

Then we have MD5(MD5(SSS)) == Y, which was only twice harder than finding a single MD5 collision.

Could someone explain what is wrong with my reasoning ?

Edit: Oh okay, got it, when we say "MD5 is broken, it's possible to do a collision attack", what we mean is that we can easily find 2 strings S1 and S2 where MD5(S1) == MD5(S2) But S1 and S2 and found randomly, we don't have a way to find a string S3 where MD5(S3) == Y for any Y value (that is what we call a pre-image attack, not a collision attack)

H8crilA 819 days ago

Pre-image is approximately "twice as difficult" as a collision. A generic attack on, say, a 256 bit long hash function takes 2^128 time to find a collision, but 2^256 time to find a preimage. And like you say, this also shows up in practice: both MD-5 and SHA-1 are completely broken when it comes to collision resistance, but both are (probably) still OK for preimage resistance. I would still not recommend either of them for anything.

tialaramex 819 days ago

Where on earth did you get this idea from? What is a "generic attack"? How could you turn a collision somehow into a pre-image attack? How is many orders of magnitude "twice" ?

popol12 819 days ago

twice as difficult ? It doesn't match what you say after that

2²⁵⁶ = 2¹²⁸ * 2¹²⁸

So, isn't it rather 2¹²⁸ times more difficult ?

renonce 818 days ago

I understand the definitions of such crypto algorithms but have no idea about differential cryptanalysis. Can someone explain how attacks like this are constructed, and why it took 8 years to advance cryptanalysis by 3 rounds? What insight was needed that took 8 years to discover and formulate as a practical attack?

0x073 819 days ago

Good that git still use sha1 ;)

hn_p4ttern 819 days ago

Is it used to sign a commit, right ? Which are the probabilities to have a collision that:

a) is still code

b) is still code AND is code similar to a previous commit

c) is still code AND is code similar to a previous commit AND is valid

d) is still code AND is code similar to a previous commit AND is valid AND makes sense for something

OR at least

a) is still code

b) is still code AND is valid

d) is still code AND is valid AND makes sense for something

Let me know.

pornel 819 days ago

For now the SHA-1 collisions are easily detectable, but it could get worse.

In case of MD5, there is now a collision I wouldn't expect was possible: in readable ASCII.

https://mastodon.social/@Ange/112124123552605003

hn_p4ttern 819 days ago

> "For now the SHA-1 collisions are easily detectable, but it could get worse."

Your opinion: prove it! And Again, if you instead of trolling actually read the post in THIS BRANCH , the question is: shout SHA-1 inn GIT be substituted ?

jacobgorm 819 days ago

It is used to name a a commit, not to sign it. So the data structure itself will be corrupted if there is a collision, as it relies on the invariant that each commit has a unique name. And the collision has to happen within a single repo.

cyph3r0 819 days ago

> It is used to name a a commit, not to sign it.

This is bullshit. Really. If you have only to "name a a commit" you can use a sequence from 0 to N. Why someone should waste computation power to calculate an hash that's also a naming system really not user friendly? Think about it.

The correct answer is to signing the commit AND for database indexing: "Git uses hashes in two important ways.

When you commit a file into your repository, Git calculates and remembers the hash of the contents of the file. When you later retrieve the file, Git can verify that the hash of the data being retrieved exactly matches the hash that was computed when it was stored. In this fashion, the hash serves as an integrity checksum, ensuring that the data has not been corrupted or altered.

For example, if somebody were to hack the DVCS repository such that the contents of file2.txt were changed to “Fred”, retrieval of that file would cause an error because the software would detect that the SHA-1 digest for “Fred” is not 63ae94dae606…

Git also uses hash digests as database keys for looking up files and data.

If you ask Git for the contents of file2.txt, it will first look up its previously computed digest for the contents of that file[45], which is 63ae94dae606… Then it looks in the repository for the data associated with that value and returns “Erik” as the result. (For the moment, you should try to ignore the fact that we just used a 40 character hex string as the database key for four characters of data.)"

Source: https://ericsink.com/vcbe/html/cryptographic_hashes.html#:~:.... ~

jacobgorm 814 days ago

Earlier systems like perforce used the totally ordered integer naming scheme you describe, but it requires a centralized entity to keep the names globally unique. Using hashes for naming avoids this, and the way they are used in git imposes a partial order.

Rygian 819 days ago

For the choices after your "OR at least" line, just consider that most of the collision material could be padded into a comment, so achieving a), b) and d) would be "trivial."

hn_p4ttern 819 days ago

IMHO "be padded into a comment" is included in "is valid code", still 1 in <number_of_particles_in_universe_here^1E100> is a good approximation of that probability.

Please, correct me if I'm wrong.

maxcoder4 819 days ago

Do you mean with the current public knowledge or hypothetically? For md5 all of these are doable right now (except maybe code that "makes sense"for human reader). Also in practice it's much easier to do this with a data file, as demonstrated for SHA1 with a "backdoored" certificate.

hn_p4ttern 819 days ago

1) We are talking about sha1, md5 is out of topic

2) This is the main topic ! Being able to generate >>valid code<< with a >>specific purpose<< , so that GIT have to change its hashing algorithm;

3) A.K.A your answer is total nonsense.

Everyone else, ok, I'm listening, give proof that you can change code on GitHub stealthy messing with hashing, moreover inserting a "payload" creating a SHA-1 collision in a reasonable computational time, everything else is BS.

TrueDuality 819 days ago

Even without comments your additional requirements aren't relevant, but not in the way I think you're assuming.

When you're searching for a practical collision you only need a way to generate systematic output that semantically will be interpreted with your intent. The easiest way to do this is to include semantically irrelevant data to something that was manually produced that is semantically relevant.

In the programming domain, source code specifically, comments are the easiest way to include semantically irrelevant information but you could also include unused functions, variable names etc. You are literally limited by the constraints of your imagination and your ability to dodge CI failure checks.

Aha! You might say, but any human that saw that change or PR would immediately notice the garbage produced and catch the collision attempt! (this is your argument) Unfortunately no, that assumes your search space that I talked about is over semantic garbage. It's a bit more work, but your search space for a collision could be "Shakespearean sonnet's that would make a literary buff cry" as long as you had a generator that could produce it and produced different outputs from different seeds.

We now have access to a generator that can take an incrementing seed number, and produce both semantically meaningful and meaningful semantically irrelevant content. The language models. Interestingly this moves the compute cost to the generator (usually the compute restriction is on the hash being attacked).

It's definitely not practical with our current compute capabilities to attack a search space of 2^256 through brute force for a simple hash much less including waiting for a language model to produce an output using a different input seed for each check but that's not what this article is about either...

What these collision attacks (such as the linked article) do is _decrease the search space_. Without any algorithmic tricks the search space for sha2-256 is 2^256. These tricks are eating away at that exponent. This work results in a reduction of a collision to 2^49.8. That is a massive drop in the search space. Is it still feasible to attack today? Absolutely not. But a few more of these tricks and I can see those "garbage comments" collision happening, but wait a tiny little fraction of a time beyond that and include language models for your search space?

Hell your changes could be _productive_ and produced incrementally through a series of commits if you really wanted to limit your search space and get creative about it.

With SHA-1 collisions attacks using semantic garbage are already considered practical. We're still probably computationally constrained in using language models to produce semantically viable collisions but we're not that far off either. Those comments won't be garbage. You will not be able to distinguish it from any other AI generated code being committed which is rapidly improving in quality and efficiency to generate.

Even without language models you could use something like a language's EBNF grammar as a token generator for source code which would probably pass any glance checks, but definitely not dedicated inspection like a code review. That is probably something that IS PRACTICAL TODAY for SHA1.

hn_p4ttern 819 days ago

My point is: why you should change hashing algorithm in GIT ??? Let's elaborate:

1. Do SHA-1 put a security risk in GIT ?

2. Is that practically exploitable in any way?

In some application, for example password hashing, SSH MAC, etc, you have good reasons to change hashing algorithm when it became obsolete: because an attacker can be computationally advantaged to crack a password, to compromise the integrity of transmitted packets, etc.

But not because an hashing algorithm became obsolete for some application is obsolete for ALL possible application. Moreover, in some specific application could be DESIRABLE a fasted hashing algorithm.

So why You should change SHA-1 in GIT ?

>> "But a few more of these tricks and I can see those "garbage comments" collision happening"

I don't think so, is computationally astronomically difficult whatever tricks yo u invent. The point here IS NOT to generate a collision adding "garbage comments", again, is to alter the behaviour of committed code in a functional way.

>> "Even without language models you could use something like a language's EBNF grammar as a token generator for source code which would probably pass any glance checks, but definitely not dedicated inspection like a code review. That is probably something that IS PRACTICAL TODAY for SHA1"

Yeah, prove it!

Rygian 819 days ago

That's what I meant.

keybored 819 days ago

I believe there is one more step. You have to somehow get the collision into the repository. Because if you have <hash> in your own repo and pull something from another repo with the same <hash>, the remote changes will not overwrite your blob for <hash> (it will stay the same). Or at least that’s what I seem to remember from something that Torvalds wrote.

hn_p4ttern 819 days ago

> "I believe there is one more step. You have to somehow get the collision into the repository."

Yes, Exactly. So, is it necessary to change SHA-1 having in git ? At the moment, I think there is no reason because SHA-1 doesn't expose security vulnerabilities or functional issues.

IncreasePosts 819 days ago

Brave of you to assume I'm committing valid and sensical code to git

dist-epoch 819 days ago

Due to the way hashing works, any change is equivalent to any other one for the purpose of finding a collision.

So you can just alter the formatting to a different convention, alter spacing, add a comment, reorder equivalent lines.

So you can insert a comment and continue altering it until you get a match by varying the line breaking, switching words with synonyms.

chasil 819 days ago

Doesn't it also have to be the same size in bytes?

hot_gril 819 days ago

Good old MDA4. Nothing beats that.