Hey folks. I'm the product manager for Git at GitHub. We're sorry for the breakage, we're reverting the change, and we'll communicate better about such changes in the future (including timelines).
I want to encourage you to think about locking in the current archive details, at least for archives that have already been served. Verifying that downloaded archives have the expected checksum is a critical best practice for software supply chain security. Training people to ignore checksum changes is training them to ignore attacks.
GitHub is a strong leader in other parts of supply chain security, and it can lead here too. Once GitHub has served an archive with a given checksum, it should guarantee that the archive has that checksum forever.
I've just had a thought. When GitHub do update the hashing for better compression, everyone relying on the tar hash will update their hashes. This is the ultimate opportunity to change the tar contents, effect the supply chain, introduce vulnerabilities, and have everyone trust you. Something like Nix which computes the NAR Hash (the result of the tar contents) will not be effected by this, since it only cares about the content. I think this is much better than worrying about an unlikely tar vulnerability. In a system that only trusts the tar hashes, the original source is not able to take advantage of better compression over time, without massive risk of supply chain attack. If you think you can hand me a tarball that can run arbitrary code, for any version of tar that has ever existed, please give it to me so I can experiment with exploits, and I'll buy you a drink of your choice at FOSDEM if you're there!
You're not wrong, but you're also not being realistic.
Nix is not the only system that takes this approach. The Go modules "directory hash" is roughly equivalent, although we defined it in terms of somewhat more standard tooling: it is the output of
sha256sum $(find . -type f | sort) | sha256sum
I am not here advocating that everyone switch to this basic directory hash either, because it's not a solution to the more general problem that many systems are solving, namely validating _any_ downloaded file, not just file archives.
There are widespread, standard tools to run a SHA256 over a downloaded file, and those tools work on _any_ downloaded file. Essentially every programming language ships with or has easily accessible libraries to do the same. In contrast, there are not widespread, standard tools or libraries for the "NAR Hash" nor the Go "directory hash". Even if there were, such tools would need to be able to parse every kind of file that people might be downloading as part of a build, not just tar files.
It's a good solution in limited cases such as Nix and Go modules, but it's not the right end-to-end solution for all cases.
When you say it is not the right end-to-end solution for all cases, I am wondering what case you have in mind that a NAR Hash would not be suitable for.
If you adopt Nix fully, the .narinfo file that cache.nixos.org (a Nix substituted) serves that is signed, contains both the NAR Hash and the hash of the NAR Archive File as well. Additionally, NAR packs and unpacks deterministically, and you can read the implementation in the Nix thesis.
I would also appreciate stronger advertising of the ability to turn a Git tag into a GitHub release and upload stable source code files to it. Maybe even a button in the GitHub releases interface to “generate source tarball and attach as stable tarball to this release.”
I agree this would be great. However, it should also stop you from providing useless tarballs (as `/archive/` does today) if:
- you use autoconf (or any other tool(s) that require generating code into the source archive; or
- you have submodules (to which `git archive` is completely blind).
Note that `git-archive-all`[1] can help as long as your submodules don't do things like `[attr]custom-attr` in their `.gitattributes` as it is only allowed in the top-level `.gitattributes` file and cannot be added to the tree otherwise.
We updated our Git version which made this change for the reasons explained. At the time we didn't foresee the impact. We're quickly rolling back the change now, as it's clear we need to look at this more closely to see if we can make the changes in a less disruptive way. Thanks for letting us know.
Consumers often mistake hasn’t changed for a commitment to never change: any sufficiently large product will be littered with these kind of implicit commitments made by the product to consumers that nobody has visibility into. You’re unfortunate that we were all relying on this commitment you’ve never made, but the quick reversion is the best we can hope for. People will theorise how this could have been avoided but c’est la vie — easy mistake that you’ve responded well to.
With a sufficient number of users of an API, it does not matter what you promise in the contract: all observable behaviors of your system will be depended on by somebody.
At this point they'll be stuck on old git for all of eternity unless they just roll their own archive/compress step out of band so the old hashes still work. Yikes.
We are seeing an npm install failure inside our docker builds pointing at a github URL with a SHA change. Is this possibly related?
#15 [dev-builder 4/7] RUN --mount=type=secret,id=npm,dst=/root/.npmrc npm ci
#0 4.743 npm WARN deprecated querystring@0.2.0: The querystring API is considered Legacy. new code should use the URLSearchParams API instead.
#0 8.119 npm WARN tarball tarball data for http2@https://github.com/node-apn/node-http2/archive/apn-2.1.4.tar.gz (sha512-ad4u4I88X9AcUgxCRW3RLnbh7xHWQ1f5HbrXa7gEy2x4Xgq+rq+auGx5I+nUDE2YYuqteGIlbxrwQXkIaYTfnQ==) seems to be corrupted. Trying again.
#0 8.164 npm ERR! code EINTEGRITY
#0 8.169 npm ERR! sha512-ad4u4I88X9AcUgxCRW3RLnbh7xHWQ1f5HbrXa7gEy2x4Xgq+rq+auGx5I+nUDE2YYuqteGIlbxrwQXkIaYTfnQ== integrity checksum failed when using sha512: wanted sha512-ad4u4I88X9AcUgxCRW3RLnbh7xHWQ1f5HbrXa7gEy2x4Xgq+rq+auGx5I+nUDE2YYuqteGIlbxrwQXkIaYTfnQ== but got sha512-GWBlkDNYgpkQElS+zGyIe1CN/XJxdEFuguLHOEGLZOIoDiH4cC9chggBwZsPK/Ls9nPikTzMuRDWfLzoGlKiRw==. (72989 bytes)
#0 8.176
#0 8.177 npm ERR! A complete log of this run can be found in:
#0 8.177 npm ERR! /root/.npm/_logs/2023-01-30T23_19_36_986Z-debug-0.log
#15 ERROR: process "/bin/sh -c npm ci" did not complete successfully: exit code: 1
This was working earlier today and the docker build/package.json haven't changed.
That's what I thought, but I assumed with the rollback an hour plus ago, it wouldn't still be happening. That was off a build just a few minutes ago (actually repeated it in between the time I posted my original message and this reply and it happened again).
Just want to second this. Still seeing an issue in our build right now that seems related.
```
Building aws-sdk-cpp[core,dynamodb,kinesis,s3]:x64-linux...
-- Downloading https://github.com/aws/aws-sdk-cpp/archive/a72b841c91bd421fb... -> aws-aws-sdk-cpp-a72b841c91bd421fbb6deb516400b51c06bc596c.tar.gz...
[DEBUG] To include the environment variables in debug output, pass --debug-env
[DEBUG] Feature flag 'binarycaching' unset
[DEBUG] Feature flag 'manifests' = off
[DEBUG] Feature flag 'compilertracking' unset
[DEBUG] Feature flag 'registries' unset
[DEBUG] Feature flag 'versions' unset
[DEBUG] 5612: popen( curl --fail -L https://github.com/aws/aws-sdk-cpp/archive/a72b841c91bd421fb... --create-dirs --output /home/*redacted*/vcpkg/downloads/aws-aws-sdk-cpp-a72b841c91bd421fbb6deb516400b51c06bc596c.ta
r.gz.5612.part 2>&1)
[DEBUG] 5612: cmd_execute_and_stream_data() returned 0 after 12643779 us
Error: Failed to download from mirror set:
File does not have the expected hash:
url : [ https://github.com/aws/aws-sdk-cpp/archive/a72b841c91bd421fb... ]
File path : [ /home/*redacted*/vcpkg/downloads/aws-aws-sdk-cpp-a72b841c91bd421fbb6deb516400b51c06bc596c.tar.gz.5612.part ]
Expected hash : [ 9b7fa80ee155fa3c15e3e86c30b75c6019dc1672df711c4f656133fe005f104e4a30f5a99f1c0a0c6dab42007b5695169cd312bd0938b272c4c7b05765ce3421 ]
Actual hash : [ 503d49a8dc04f9fb147c0786af3c7df8b71dd3f54b8712569500071ee24c720a47196f4d908d316527dd74901cb2f92f6c0893cd6b32aaf99712b27ae8a56fb2 ]
```
In my particular use-case, I'm using a set of local dev tools hosted as a homebrew tap.
The build looks up the github tar.gz release for each tag and commits the sha256sum of that file to the formula
What's odd is that all the _historical_ tags have broken release shasums. Does this mean the entire set of zip/tar.gz archives has been rebuilt? That could be a problem, as perhaps you cannot easily back out of this change...
They never really stored them, they were always generated by some code (maybe with a cache layer in front). The code changed in a way that changed the bytes in the tar.gz without affecting their contents-when-extracted.
The trick here is that a Github release is in essence simply a tag of a specific commit. There is no need to build archives in advance, as they can be dynamically generated from the git repo.
However, if you change the compression algorithm used to generate the archive, it'll result in a different checksum! The content is the same, but the archive is not.
Pretty bizarre this ever was stable in the first place.
Unfortunately for this kind of service you need to actively fiddle with the bytes to prevent people from relying on an implementation detail like this and prevent them from digging you into a too big to fail api stability hole.
That's my thought as well. They could also potentially retroactively generate the source tarballs using the old method for every possible repository/tag on Github, store it, and serve that, and then only generate it on-demand for new tags, but I doubt they'll do that. They might though, given this is what led to the problem in the first place (ie; the on-demand generation vs generating on push+storing).
That seems wasteful. Many projects do not actively advertise the GitHub tag downloads, and instead have their own stored and stable tarballs (or other distributions). And I suppose many users of those auto-generated downloads don’t care about their checksums.
I want to encourage you to think about locking in the current archive details, at least for archives that have already been served. Verifying that downloaded archives have the expected checksum is a critical best practice for software supply chain security. Training people to ignore checksum changes is training them to ignore attacks.
GitHub is a strong leader in other parts of supply chain security, and it can lead here too. Once GitHub has served an archive with a given checksum, it should guarantee that the archive has that checksum forever.