Hacker News new | ask | show | jobs
by atonse 2170 days ago
This is so awesome, but the most surprising to me is that all the public source code on GitHub only totals 21 TB.

I forget that they do fundamentally host text, and not video etc.

I somehow thought it would be petabytes. The private repos might be more than that but those are historically paid.

3 comments

On the topic of size, I wonder how small it would be if you were able to deduplicate all repositories against each other. I sometimes suspect there is a tremendous amount of copy/paste code out there masquerading as someone else’s.

Even a naive deduplication might yield some very interesting results

Reminds me of a time I caught someone using someone else’s code in an interview and passing it off as their own. (Using was fine, it was the claim that it was theirs that bugged me)

I work at Software Heritage, where we archive all source code we can find, including all GitHub repositories, and deduplicate them internally.

The size of all file contents (including older versions of files) is a few hundreds TBs, and everything else (directory structures, revision history, etc.) is under 10TB.

So for GitHub alone it would be a little under that

They've just archived the HEAD of the 6000 most popular repos

> We’ve archived 6,000 of the world’s most popular repositories as a proof of concept for future archives.

> The snapshot will consist of the HEAD of the default branch of each repository, minus any binaries larger than 100KB in size.

Archive Program director here - the 6,000 repos were on the single proof-of-concept reel we archived last autumn. The full archive consists of millions of repos, including all repos with at least one star with any commits in the year leading up to 02/02/2020.
Hello Jon, the info page mentions binaries larger than 100KB are not archived. What about images of 500KB? I really am curious what these archived tar.xz files flook like. Would have been nice if the project site included an example of what retrieved data will look like. A lot of readme.md files have illustrations. Either way it's a cool project and I like what your team did.
where can we find the full list of the 6000 archive repo?
20 of those are probably node_modules folders
node_modules wouldn't make it to git repo. at least, the top 6000 repo on github. that's for sure.