Hacker News new | ask | show | jobs
by avar 2605 days ago
I contribute to git, and I read the mailing list, where people a lot smarter than me comment about this sort of thing.

I'm also on a team that runs an in-house enterprise GitLab instance for an S&P 100, so I have experience with it in that configuration, which I understand isn't different from what gitlab.com uses in this regard.

None of this is secret or some sort of insider knowledge. If you know how "git gc" works you can trivially observe most of the behavior of these hosting sites from the outside.

E.g. try pushing a commit and then view it at to git{hub,lab}.com/YOU/PROJECT/commit/SHA-1. Then "push --delete" the branch that references it.

You'll find that you can still view it on both sites, even if when you clone the relevant repository you won't get that SHA-1. This is because it's expensive to do a reachability check before serving up the content, and the web frontends access the object store directly.

Then if you e.g. keep making pushes sufficient to trigger a "gc --auto" and it's been longer than the relevant git "gc.Expire" time(s) you can deduce that the site uses something close to git's default "gc" semantics, or not. If you do this on GitHub.com you'll find you can access the data for longer than that, possibly "forever".

Which is actually a thing relevant to data recovery in this case. If those impacted by this security incident have lost their data, but have some of the SHA-1s involved (e.g. because they were pasted in IRC) they might find they can still view that content on gitlab.com if they were to browse it in the commit/tree/blob view, and painfully recover it that way. They won't be able to clone it since neither site turns on uploadpack.allowAnySHA1InWant=true.

1 comments

I know some git internals including gc, expiration, reflog etc but your description was still very interesting. Thanks for taking the time to write this!