| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by Cthulhu_ 2144 days ago
	SHA1 is close to being broken, but it's not there yet, and Git will be migrating to a better algorithm. That said, if you could rewrite an older commit, the change would only be applied in a fresh clone, right?

2 comments

db48x 2144 days ago

Even if you could break SHA1, it's unlikely that your replacement source code would look like it was human-written. Instead, it's going to look like human-written source code containing kilobytes or megabytes of random-looking comments. The comments will only be there to change the hash of the new content back to the hash of the original content. It's not going to be subtle at all.

link

flingo 2144 days ago

Why would it require that much data? I always thought you wouldn't need to add or change more bytes than are in the output.

Also, git hashes aren't just based on source code. You can add that data anywhere that git uses to generate the hash.

link

db48x 2144 days ago

That's true of a CRC code, but hashes are a lot harder to break.

Git hashes each file, and puts those hashes into a tree object, like a directory listing. Then it hashes the trees, recursively back up to the root of the repository. Finally the hash of the root tree is put in the commit object, and the commit object is hashed. Thus the two places you can put additional data to be hashed are the file contents (either in existing files or new files), or in the commit message. You can get a few free bits by adjusting less obvious things like the commit timestamp or the author's email address, but not nearly enough to make your forged commit have the same hash as an existing commit.

link

flingo 2144 days ago

I'm still not following why it'd require so much data? I thought the goal was to have the commit hash collide with an existing commit hash, is that not enough?

I looked around, and it seems like the right place to hide the added data is in the "trailer" section of the commit. It's where signed-off-by lives and is used to generate the commit hash.

You might want to come up with a plausible reason for random data to go in there though. (likely using a header that wouldn't normally get printed out)

link

db48x 2143 days ago

In a CRC-style code, you're essentially adding up all the bytes and letting it overflow the counter, so that the counter is a fixed size (usually 16 or 32 bits). Then you add a few more bytes, exactly the same size as the counter, so that the data bytes plus the extra bytes add up to zero. The extra bytes are delivered along with the data bytes, so that the recipient can repeat the calculation and verify that the total is still zero. If you modify the data, it is trivial to recalculate the CRC code so that the total is still zero.

Hashes are much, much more complex, and they're non-linear. Each bit of the hash output is intended to depend on every single bit of the input, so that changing a single bit in the input creates a radically different hash output.

In a paper published this year, https://eprint.iacr.org/2020/014.pdf, the authors Gaëtan Leurent and Thomas Peyrin changed the values of 825 bytes out of a 1152 byte PGP key in order to generate a new key with the same signature (aka, the same hash). It only cost about $45k, too.

link

kibwen 2144 days ago

The git hash surely also takes the contents of binary files into account, so I imagine that in any repo that contains non-text files, an attacker would try to hide the garbage inside e.g. some metadata field of an image file.

link

db48x 2144 days ago

That's true. PDFs and other document formats are also great because you can include large volumes of data that is never used in the final output.

link

tomxor 2144 days ago

> That said, if you could rewrite an older commit, the change would only be applied in a fresh clone, right?

I think so, assuming the fetch algorithm is using the hashes to get the deltas which I think it does.

I'm not sure about CVS but with GIT rewriting a _previous_ commit _object_ itself with different blobs but making the commit object itself have the _same_ hash by messing with it's comment wouldn't cause any difference in child commits since commits are pretty much independent other than the pointers to parent/child and incorporating that into it's hash (i.e they would have different trees so the changes would not propagate to the HEAD of the branch).

I think the only way have something end up in the HEAD of a branch AND persist is to break the SHA1 of a blob (i.e a file) by inserting the extra SHA1 breaking content into the blob itself rather than a commit tree (provided that exact blob hash is part of the tree in the HEAD of a branch). Then you would also need to hope that the malicious blob is fetched by the person who writes the next commit to be based upon the HEAD of that branch AND modifies the same file blob so that it persists into the next revision of the blob... seems pretty hard to pull off - pun intended

There is also the issue of pushing a blob that already exists on the remote according to the hash. Even with re-write permission GC might make that hard to do quickly.... I wonder if you would need direct access to the git server to do this.

[EDIT]

Thinking about swapping out SHA1 in the future, you would still want to rehash all of the blobs and trees to prevent SHA1 attacks on old blobs that are unchanged going forward to essentially prevent what I described above.

If you only hashed new blobs with the new algorithm you would need to wait until every file had been touched to be safe.

link

eru 2143 days ago

Yes, I would assume that most git repositories would want to re-hash all old commits when SHA1 gets replaced.

For backwards compatibility, I suspect we'll add the new hash and keep SHA1 around, unless you specifically disable SHA1.

link