Hacker News new | ask | show | jobs
by bk2204 3386 days ago
I'm the person who's been working on this conversion for some time. This series of commits is actually the sixth, and there will be several more coming. (I just posted the seventh to the list, and I have two more mostly complete.)

The current transition plan is being discussed here: https://public-inbox.org/git/CA+dhYEViN4-boZLN+5QJyE7RtX+q6a...

5 comments

I do like your hashname/nohash idea. If we could come up with a simple compression negotiation protocol also: zlib -> zstd. But this will be much harder, as hashes are internal only, and compression is in the protocol.

kudos to brian m carlson to convince linus to use sha3-256 over sha256. this is really the only sane option we have.

I don't understand what you mean by "hashes are internal only"? Aren't the sha1's everywhere right now. I mean not only in the protocol but also part of the UI and from there they even spread into bug trackers, documentation and so forth.
> this is really the only sane option we have

Why?

Yeah, I would have gone with BLAKE2. It's much faster than SHA-256 and SHA3-256: https://blake2.net/skylake.png
This is a perfect example of a situation where hashing performance doesn't matter at all.
I'm not familiar with Git internals. Does the performance of the hashing algorithm contribute significantly to how Git deals with large files or with operations over a large number of small files?

I've run into performance problems with things like MathJaX, which includes thousands (or tens of thousands?) of files as a backup method for rendering equations. (I understand each file has a single character in some typeface.)

The hash function may not matter for overall git performance in virtually all dev machine setups, but there will be a (maybe tiny, maybe larger, depending on the repo and disk io speed) difference in cpu utilization and heat generation, right?
That's a silly thing to worry about when you're developing Ruby or Java applications. My PC boots faster than the Rails console or IntelliJ.
sha3 will probably get hw accel eventually. Blake2 is less likely to. It's like the dilemma between chacha20 and a stream cipher mode for aes. An argument could be made for either, depending on application specifics and available hardware.
But like its ancestor chacha, blake2 is fast on anything that has SIMD.
Did you make any measurements or back of the envelope calculations what the real world performance impact of this change is.

I don't expect anything horrible, but still curious.

EDIT: After skimming OP I found a few answers.

The message from the The Keccak Team [1] is especially interesting. Summary is that we don't have to worry about performance degradation because of the hash calculation itself. There is a palette of functions which are considered to have a "security level [...] appropriate for your application" and are considerably faster than SHA1.

[1] https://public-inbox.org/git/91a34c5b-7844-3db2-cf29-411df5b...

If git changed to BLAKE2b, I'd expect a perf improvement over SHA-1.
Out of curiosity: when did you start to take the first serious steps in this direction?
From the commit history, 2015 (commit 5f7817c85d4b5f65626c8f49249a6c91292b8513).

I proposed the idea of improved compile-time checking and maintainability, as there wasn't originally much interest in a new hash function, but the maintainability improvements were something people could go for.

I hadn't spent as much time working on it as I am now, so it moved slowly. Other people also helped by converting parts of the code that they were working on (like parts of the refs subsystem).

Thanks!
I'm not quite so familiar with the Git internals, how do you deal with the problem of having different non-leaf nodes scattered through the directory tree?

This might be a non-issue based on how Git stores the tree, but I can imagine one simple model where each directory would be a sort of "collection object", a binary encoding of a list of (filename, hash) pairs in filename order, and therefore the directory gets a hash of its own. But that means that when you're communicating with a SHA-1 repository you don't just need to rename this object; its contents also need to be changed pre-rename, and then you need to store every internal node twice. I'm not seeing that in your summary.

Is it just that Git doesn't have any internal nodes in the directory tree per se because the "filename" is a full POSIX path with subdirs? Or what?

https://git-scm.com/book/en/v2/Git-Internals-Git-Objects has descriptions of the objects. Both trees and commits are hashes over data that includes hashes of other objects so they must be different. The doc discusses converting them at transmission time, search for [convert to sha256] in it.
>b. A SHA256 repository can communicate with SHA-1 Git servers and clients (push/fetch).

Wouldn't fetching from a sha-1 repository degrade security? I think it would be better to show a warning (similar to how openssh does with 1024 bit dsa keys) every time you try to fetch from a SHA-1 git repo. Same for pushing a signed commit to a sha-1 repository.

The sha1 hash isn't used for security. You should be signing your commits if security is a concern.
Uh, even a signed commit does still rely on the sha1 hash of the actual tree object and any parent commits. It won't stop something bad from happening if you fetch from a sha1 repo.