Hacker News new | ask | show | jobs
by no_wizard 2170 days ago
On the topic of size, I wonder how small it would be if you were able to deduplicate all repositories against each other. I sometimes suspect there is a tremendous amount of copy/paste code out there masquerading as someone else’s.

Even a naive deduplication might yield some very interesting results

Reminds me of a time I caught someone using someone else’s code in an interview and passing it off as their own. (Using was fine, it was the claim that it was theirs that bugged me)

1 comments

I work at Software Heritage, where we archive all source code we can find, including all GitHub repositories, and deduplicate them internally.

The size of all file contents (including older versions of files) is a few hundreds TBs, and everything else (directory structures, revision history, etc.) is under 10TB.

So for GitHub alone it would be a little under that