Hacker News new | ask | show | jobs
by hn_p4ttern 819 days ago
Is it used to sign a commit, right ? Which are the probabilities to have a collision that:

a) is still code

b) is still code AND is code similar to a previous commit

c) is still code AND is code similar to a previous commit AND is valid

d) is still code AND is code similar to a previous commit AND is valid AND makes sense for something

OR at least

a) is still code

b) is still code AND is valid

d) is still code AND is valid AND makes sense for something

Let me know.

7 comments

For now the SHA-1 collisions are easily detectable, but it could get worse.

In case of MD5, there is now a collision I wouldn't expect was possible: in readable ASCII.

https://mastodon.social/@Ange/112124123552605003

> "For now the SHA-1 collisions are easily detectable, but it could get worse."

Your opinion: prove it! And Again, if you instead of trolling actually read the post in THIS BRANCH , the question is: shout SHA-1 inn GIT be substituted ?

It is used to name a a commit, not to sign it. So the data structure itself will be corrupted if there is a collision, as it relies on the invariant that each commit has a unique name. And the collision has to happen within a single repo.
> It is used to name a a commit, not to sign it.

This is bullshit. Really. If you have only to "name a a commit" you can use a sequence from 0 to N. Why someone should waste computation power to calculate an hash that's also a naming system really not user friendly? Think about it.

The correct answer is to signing the commit AND for database indexing: "Git uses hashes in two important ways.

When you commit a file into your repository, Git calculates and remembers the hash of the contents of the file. When you later retrieve the file, Git can verify that the hash of the data being retrieved exactly matches the hash that was computed when it was stored. In this fashion, the hash serves as an integrity checksum, ensuring that the data has not been corrupted or altered.

For example, if somebody were to hack the DVCS repository such that the contents of file2.txt were changed to “Fred”, retrieval of that file would cause an error because the software would detect that the SHA-1 digest for “Fred” is not 63ae94dae606…

Git also uses hash digests as database keys for looking up files and data.

If you ask Git for the contents of file2.txt, it will first look up its previously computed digest for the contents of that file[45], which is 63ae94dae606… Then it looks in the repository for the data associated with that value and returns “Erik” as the result. (For the moment, you should try to ignore the fact that we just used a 40 character hex string as the database key for four characters of data.)"

Source: https://ericsink.com/vcbe/html/cryptographic_hashes.html#:~:.... ~

Earlier systems like perforce used the totally ordered integer naming scheme you describe, but it requires a centralized entity to keep the names globally unique. Using hashes for naming avoids this, and the way they are used in git imposes a partial order.
For the choices after your "OR at least" line, just consider that most of the collision material could be padded into a comment, so achieving a), b) and d) would be "trivial."
IMHO "be padded into a comment" is included in "is valid code", still 1 in <number_of_particles_in_universe_here^1E100> is a good approximation of that probability.

Please, correct me if I'm wrong.

Do you mean with the current public knowledge or hypothetically? For md5 all of these are doable right now (except maybe code that "makes sense"for human reader). Also in practice it's much easier to do this with a data file, as demonstrated for SHA1 with a "backdoored" certificate.
1) We are talking about sha1, md5 is out of topic

2) This is the main topic ! Being able to generate >>valid code<< with a >>specific purpose<< , so that GIT have to change its hashing algorithm;

3) A.K.A your answer is total nonsense.

Everyone else, ok, I'm listening, give proof that you can change code on GitHub stealthy messing with hashing, moreover inserting a "payload" creating a SHA-1 collision in a reasonable computational time, everything else is BS.

1) yes, I gave you an example of a hash algorithm that is broken right now. SHA1 is only getting there, because the attacks are always only getting stronger. Responsible people don't wait until the attacks are practical and devastating, but instead react by predicting the obvious things that will happen in the future.

Overall I don't think you're arguing in good faith, so I'm going to walk away from this discussion.

Even without comments your additional requirements aren't relevant, but not in the way I think you're assuming.

When you're searching for a practical collision you only need a way to generate systematic output that semantically will be interpreted with your intent. The easiest way to do this is to include semantically irrelevant data to something that was manually produced that is semantically relevant.

In the programming domain, source code specifically, comments are the easiest way to include semantically irrelevant information but you could also include unused functions, variable names etc. You are literally limited by the constraints of your imagination and your ability to dodge CI failure checks.

Aha! You might say, but any human that saw that change or PR would immediately notice the garbage produced and catch the collision attempt! (this is your argument) Unfortunately no, that assumes your search space that I talked about is over semantic garbage. It's a bit more work, but your search space for a collision could be "Shakespearean sonnet's that would make a literary buff cry" as long as you had a generator that could produce it and produced different outputs from different seeds.

We now have access to a generator that can take an incrementing seed number, and produce both semantically meaningful and meaningful semantically irrelevant content. The language models. Interestingly this moves the compute cost to the generator (usually the compute restriction is on the hash being attacked).

It's definitely not practical with our current compute capabilities to attack a search space of 2^256 through brute force for a simple hash much less including waiting for a language model to produce an output using a different input seed for each check but that's not what this article is about either...

What these collision attacks (such as the linked article) do is _decrease the search space_. Without any algorithmic tricks the search space for sha2-256 is 2^256. These tricks are eating away at that exponent. This work results in a reduction of a collision to 2^49.8. That is a massive drop in the search space. Is it still feasible to attack today? Absolutely not. But a few more of these tricks and I can see those "garbage comments" collision happening, but wait a tiny little fraction of a time beyond that and include language models for your search space?

Hell your changes could be _productive_ and produced incrementally through a series of commits if you really wanted to limit your search space and get creative about it.

With SHA-1 collisions attacks using semantic garbage are already considered practical. We're still probably computationally constrained in using language models to produce semantically viable collisions but we're not that far off either. Those comments won't be garbage. You will not be able to distinguish it from any other AI generated code being committed which is rapidly improving in quality and efficiency to generate.

Even without language models you could use something like a language's EBNF grammar as a token generator for source code which would probably pass any glance checks, but definitely not dedicated inspection like a code review. That is probably something that IS PRACTICAL TODAY for SHA1.

My point is: why you should change hashing algorithm in GIT ??? Let's elaborate:

1. Do SHA-1 put a security risk in GIT ?

2. Is that practically exploitable in any way?

In some application, for example password hashing, SSH MAC, etc, you have good reasons to change hashing algorithm when it became obsolete: because an attacker can be computationally advantaged to crack a password, to compromise the integrity of transmitted packets, etc.

But not because an hashing algorithm became obsolete for some application is obsolete for ALL possible application. Moreover, in some specific application could be DESIRABLE a fasted hashing algorithm.

So why You should change SHA-1 in GIT ?

>> "But a few more of these tricks and I can see those "garbage comments" collision happening"

I don't think so, is computationally astronomically difficult whatever tricks yo u invent. The point here IS NOT to generate a collision adding "garbage comments", again, is to alter the behaviour of committed code in a functional way.

>> "Even without language models you could use something like a language's EBNF grammar as a token generator for source code which would probably pass any glance checks, but definitely not dedicated inspection like a code review. That is probably something that IS PRACTICAL TODAY for SHA1"

Yeah, prove it!

I agree, the necessity of something stronger than SHA-1 should be demonstrate.
I don't need to prove that I can do a thing to prove that a thing is possible and the burden of proof is on you claiming that this isn't an active security problem because that's basically well known and well understood. The only outstanding questions is how-detectable, impactful, and available those attacks are.

Specifically the things you need to counter is at least one of the thing in the following list:

* Hash security: SHA1 collisions are feasible to generate and companies are actively moving away from them with good reason and have been doing so for at least seven years (https://security.googleblog.com/2017/02/announcing-first-sha..., https://www.howtogeek.com/238705/what-is-sha-1-and-why-will-...)

* Content generation: As I've already discussed, the contents of what you use to make that collision can be anything you want and meet any requirements you have the ability to produce a generator for. To meet this you're going to have to prove to me that no engineer can make a seeded random number that uses a language's grammar to produce plausible and valid to compile token, or to just use a language model to produce plausible code and comments (also requiring a seed). This is a _trivial_ thing to do.

* The attack: Git relies on a chain-of-hashes based on SHA1, those hashes are over the complete files included in the repository if you can generate a collision for a file in git's history you can replace the files in that commit and all subsequent commits will remain valid. This is the attack everyone is worried about related to git. The only thing that protects against this right now is the security of SHA1. Additionally signatures on commits and tags DO NOT protect against this, they're over the hash, commit message, and list of objects not the objects themselves. The attacked files will still look like they came from a valid signed commit.

The extra scary part of that attack is the malicious/changed file will not be visible to any existing checkouts, those clients will believe they have the correct object and will continue to show that correct object. But anything that does regular fresh checkouts, like say a CI system that deploys to prod, will get the poisoned object. Even if its checking the signatures on every commit, it won't see this coming.

So the security of all our git repos, our production environments, new devs are foundationally rooted in the security of either write access to the repository OR the foundational security of SHA1.

I would say that is a practical and useful attack. A faster hashing algorithm will EXACERBATE this problem as you're almost always trading collision resistance for speed. Any hashing algorithm that allows you to calculate its hashes faster is MORE vulnerable to collision attacks not less.

"Computationally astronomical" isn't a very good argument. 20 years ago SHA1 was insane in its security. These thing get weaker over time and need to be periodically replaced, not because they're failing, but because increased resource capacity has fundamentally changed the original assumptions the algorithm was designed for.

Even with the computationally astronomical argument that is a matter of cost and resources, not practicality. It absolutely is practical to do if the result is worth the outcome. What is the most famous git based project? Maybe the thing it was originally designed to manage... Think maybe _any_ nation state would be happy to pay less than ~$100k USD (https://sha-mbles.github.io/) to get some malicious code running in production builds of the Linux kernel? The kernel project specifically has extra manual checks and multiple "known good repos" with commits literally being added by hand to protect against this attack. It's practical, it's a problem. It needs to be fixed.

If you still insist on a working example pay me $125k and I'll produce one for you.

If someone can change a committed file inside a git repository , the main problem is that your system is FUBAR. Let's say I'm the attacker and I'm inside I can change committed files and I can generate a collision for each. If my goal is to deface the repository I can insert file with gibberish, i.e. I have a file with source code:

... omissis ...

ptr=calloc(SIZE, sizeof(long));

... etc ...

then I have :

aDjw'pfojqe'rf[24oijgfpoemgl;m,g02ir-9u13]9fu24[efgje2ioprn

Same sha1 hash.

But wait, why should waste 1000 GPU to deface a Git repository when I can simply delete it. I can change the files, I can delete it. It's simply stupid.

An attack with a sense is to change this:

ptr=calloc(SIZE, sizeof(long));

inserting:

ptr=calloc(SIZE-10, sizeof(long));

Now I have a BOF, same hash, only a code review can find the fraudulent change.

This is beyond "I make a collision inserting commented gibberish" , like this:

// adojwqf'pjqeworivhneq;lnvl;dqjnfvljeqrvneljvn

You have to insert a change that works and implement an attack making it invisible.

Good luck with that. I also read in some comments some AI nonsense I find Star Trek bullshit.

> If you still insist on a working example pay me $125k and I'll produce one for you

Even with 100M$ budget, you can't.

But why I even want to do that ? I have access, I can replace the whole repo with one full of exploitable bugs !

So the initial question: "If I change sha-1 in Git with some newer version, is that a security improvement?" . I feel the the answer is "NO".

That's what I meant.
I believe there is one more step. You have to somehow get the collision into the repository. Because if you have <hash> in your own repo and pull something from another repo with the same <hash>, the remote changes will not overwrite your blob for <hash> (it will stay the same). Or at least that’s what I seem to remember from something that Torvalds wrote.
> "I believe there is one more step. You have to somehow get the collision into the repository."

Yes, Exactly. So, is it necessary to change SHA-1 having in git ? At the moment, I think there is no reason because SHA-1 doesn't expose security vulnerabilities or functional issues.

Brave of you to assume I'm committing valid and sensical code to git
Due to the way hashing works, any change is equivalent to any other one for the purpose of finding a collision.

So you can just alter the formatting to a different convention, alter spacing, add a comment, reorder equivalent lines.

So you can insert a comment and continue altering it until you get a match by varying the line breaking, switching words with synonyms.

Doesn't it also have to be the same size in bytes?