Hacker News new | ask | show | jobs
by cmrx64 2170 days ago
it's probably less effort to just archive the whole damn thing and let the future figure it out than to decide important things to archive and leaving everything else to disappear someday
2 comments

Archive Program director here. One of the most interesting things our advisory committee told us is that it's really hard to determine what's important in advance: history is replete with lists composed by wealthy people of the books they thought most important, carefully preserved for posterity, whereas what modern historians _really_ want is ordinary people's shopping lists, of which almost none survived. That's one reason we cast a wide net and archived millions of repos instead of eg just the most-starred 100K..Even seemingly trivial repos might collectively be the modern technological equivalent of Renaissance shopping lists, for the historians who may take a particular interest in this (possibly) especially wacky and volatile era.
thank you so much for doing this work btw, archival is one of my loves :)
I wonder how much space you'd save if you excluded repos with only 1 star or only 1 commit.
They’ve excluded pretty much everything below a hundred stars, from what I see.
The inclusion criteria[0] were:

> The snapshot will include every repo with any commits between the announcement at GitHub Universe on November 13th and 02/02/2020, every repo with at least 1 star and any commits from the year before the snapshot (02/03/2019 - 02/02/2020), and every repo with at least 250 stars.

[0] https://archiveprogram.github.com