Hacker News new | ask | show | jobs
The GitHub Arctic Code Vault (github.blog)
159 points by gingernaut 2170 days ago
26 comments

As David Rosenthal (formerly of Sun, NVIDIA, and Stanford) explains, the actual Arctic Code Vault is a PR stunt, and has almost no chance of helping anyone in any kind of realistic disaster scenario: https://blog.dshr.org/2019/11/seeds-or-code.html

That said, the rest of the project, which focuses on preserving several independent copies of repositories hosted on GitHub with a handful of partner organizations, is quite useful. From the same post: "They are using a range of technologies, making feeds available over the Internet, and partnering with the Internet Archive, the Software Heritage Foundation and the Bodleian Library. These are mostly things which will get used in the foreseeable future, and should be applauded for that reason."

>> They drag the 200 platters out into the 24hr sunshine, plug the solar panel into the Raspberry Pi, point its camera through a magnifying glass at the first frame, and let the QR app they happen to have on the Pi's micro--SD card do its thing. A couple of seconds later they have the first 2,900 bytes on the USB drive. It takes another couple of seconds to move to the next frame by hand. So they sit there for 383 days scanning a frame every 4 seconds to decode the entire archive. Except there's only sunshine enough for the Pi half the year, so it takes rather more than two years. Then they need to start the Pi building all that code...

>> Of course, this is ridiculous. No-one will decode this archive in the foreseeable future.

Yes, no one will be digging code out of Github right after the apocalypse. But what about 200 years after the apocalypse? Or maybe just 1,000 years from now, no apocalypse needed? I could see the archive being of immense historical value.

> "But what about 200 years after the apocalypse? Or maybe just 1,000 years from now, no apocalypse needed?"

-Thanks to flash memory cell charge leakage, I'd be surprised if the micro-SD card or USB drive kept its data for more than 3-5 years. They're designed for low cost, not longevity.

-The electrolytic caps will probably have dried out and failed by 50-100 years.

-The plasticizers used will have evaporated away by a century, leaving any plastic or rubber components brittle and crumbly.

-The lead free solders used in modern electronics are prone to the "tin whiskers" phenomenon. Not sure about the mitigations or timeframe for growth but a couple centuries is far, far longer than any reasonable design timeframe, making it a distinct possibility in my mind.

-At 1000 years, I'd wonder about diffusion effects in chips wrecking the circuits. It would be interesting to do a calculation to see how long that would take for an unpowered chip at room temperature.

Right, so it wouldn't be a 2020 computer. It would be whatever new computer they've built.
By then, why would they need code from GitHub? Given that they will won't even be able to run it in any shape or form.
To study history and culture. And who knows, there may well be algorithms we came up with but which no one ever re-discovered.
The Pi isn't going to work after 200 years. Its flash will be wiped. Never mind aging on all the other parts.
Presumptively, the Tech Tree will have some way of bootstrapping a system capable of decoding the tapes. They say in the introduction that it’s nearly useless to access the tapes without a computer and that they expect whoever is reading this is to have a computer that is centuries more advanced than we have now.

Maybe they just zip tied a ThinkPad to the tape reader and pray that it can eat whatever happens to it in the vault.

Archive Program director here - it's really not a PR stunt, we genuinely believe it will be of significant historical value and quite a good chance it will be of practical value.

Much of that is "if we forget technology which we realize somewhere down the road we actually might want to use again." History provides plenty of examples of this, and it's particularly important with a technology which mostly lives on ephemeral media that only lasts a few decades.

Even if you do expand your speculation to post-disaster scenarios, though, while it's true the archive wouldn't be an instant reset button, it would help greatly accelerate the recovery of technology. It's worth noting that it will come with a slew of (human-readable, not encoded) technical works regarding subjects ranging from modern software engineering to microprocessor design to photolithography to power systems, which we call the Tech Tree, along with a guide and index to all the stored repos. Wherever its inheritors / discoverers may be in terms of technological advancement, and especially if they have modern-ish hardware (which can last much, much longer than most storage media), recovering the archive's contents will be a lot faster than rediscovering them from scratch.

(Also worth noting we'll be storing "greatest hits" copies of the ~15,000 most-starred / most-relied-on repos, along with a sampling of several thousand repos with few/no stars, in a selection of places like Oxford's Bodleian Library; our hypothetical future tech seekers won't have to go all the way to Svalbard for those.)

I don't want to stress the doomsday scenarios too much, though, despite our ongoing pandemic. I think the most likely outcome by far is that progress will continue; the archive may be useful to recover a couple of otherwise forgotten technologies that suddenly become important / interesting; and it will ultimately be chiefly of interest to historians. That historical value is a key reason why it casts such a broad net. I too have a couple of fairly unsophisticated pet projects in there that the future won't be interested in individually - but collectively is another matter. One of the most interesting things our advisory committee told us is that history is replete with lists composed by wealthy people of the books they thought most important, carefully preserved for posterity, whereas what modern historians _really_ want is ordinary people's shopping lists, of which almost none survived. That's one reason there are millions of repos in the Arctic now, instead of eg just the most-starred 100K: some of those may be the modern technological equivalent of Renaissance shopping lists, for the historians who may take a particular interest in this (possibly) especially wacky and volatile era.

I know it's an inherently cinematic and dramatic project and so it's tempting to call it a PR stunt ... but I assure you, it's not, and, speaking personally, I would never have gotten involved with it if I thought it was.

People have some legitimate and some less legitimate criticisms here, in the HN comments section of course, but I for one think this is a fantastic effort and I'm pleasantly surprised to read what the new badge I saw on my profile yesterday is actually about.

There will always be "negative Nancies" -- especially here, they are everywhere -- but personally I'd just like to say thanks for having some vision outside of the normal day-to-day of making money for shareholders and keeping regular customers happy. More of this, please.

Did people with repositories know this was going to happen and did you give them a choice to opt out?
Rather more eloquently asked than by the other person I saw querying this[0]! I suspect it's covered under Github's TOS - specifically[1], only public repositories were included and these are all effectively just backups. Especially in the case of the vault in Svalbard. But you can opt out of the 'warm storage'[0].

[0] https://github.com/github/archive-program/issues/36 [1] https://docs.github.com/en/github/site-policy/github-terms-o...

I recognize they wouldn't have done it unless they felt confident of having the legal right, but it's just bad manners not to ask first.

If that's the case, this not-a-PR stunt degraded my impression of them.

I'm quite certain this isn't what their customers contemplated when reading "backup" in their ToS.

EDIT: Interestingly it says "This license does not grant GitHub the right to sell Your Content or otherwise distribute or use it outside of our provision of the Service.

It also says "You still have control over your content".

Is a subarctic vauly really within the ordinary course of providing the service? Did content owners have an opportunity to exert any control?

Most probably think it's neat, but GitHub would be naive to imagine everyone would consent.

Also what happens if it turns out one of those repos had personal information in it and the subject makes a GDPR right-to-forget demand? Are they going to drag it out and purge that bit of tape?

>Also what happens if it turns out one of those repos had personal information in it and the subject makes a GDPR right-to-forget demand? Are they going to drag it out and purge that bit of tape?

I believe GDPR has exemptions for archives ([0] section 28) so that's less of a concern for them I imagine. I recognise what you're saying, but I think anyone _very_ opposed would have a difficult time in court arguing GitHub should remove their work/name/etc. My (very loose) understanding of the law is that they would have to demonstrate some kind of loss. That being said, GitHub could just have sent a notification email with very little effort. Maybe 'no harm, no foul' applies here?

[0] https://www.legislation.gov.uk/ukpga/2018/12/schedule/2/part...

Hi Jon, Congratulations on moving forward with this. Thank you! If you ever think about what might come next in terms of being able to re-make computers and so on from scratch, here is a concept website I put up around 1999 (when I was trying to get NASA to support the work for space settlements). I still work on the general idea on-and-off in my spare time (generally at a more abstract level of software for sensemaking and organizing information) but so many other distractions get in the way: https://www.kurtz-fernhout.com/oscomak/goals.htm

From there: "The OSCOMAK project is an attempt to create a core of communities more in control of their technological destiny and its social implications. No single design for a community or technology will please everyone, or even many people. Nor would a single design be likely to survive. So this project endeavors to gather information and to develop tools and processes that all fit together conceptually like Tinkertoys or Legos. The result will be a library of possibilities that individuals in a community can use to achieve any degree of self-sufficiency and self-replication within any size community, from one person to a billion people. Within every community people will interact with these possibilities by using them and extending them to design a community economy and physical layout that suits their needs and ideas. As the internet has grown, it has enabled collaborative work which has created many success stories, including Linux, Python, GCC, Squeak and other projects. We want to harness that power and apply it to organizing technological knowledge in concert with many interested individuals. The main project goal is to develop an on-line library of technology ideas, techniques, and tools, including a range from high-tech processes like plastics to medium-tech like ceramic houses to low-tech like spinning wheels. Also included will be biotechnology processes, like perennial agriculture, companion planting, sheep farming, and eventually cloning and DNA synthesis. One process to be included is a way to convert the high-tech computerized library to a low-tech paper one as desired. Key to the whole endeavor will be to present everything in a how-to fashion. Also needed is a way to map out and simulate the interrelations of processes; for instance, sheep raising requires veterinarians, antibiotics, feed, fencing, and shears; shears require a blacksmith, metal, and a furnace. This latter feature also would be used to keep track of the product flows into, out of, and within a community's entire economy."

> Also worth noting we'll be storing "greatest hits" copies of the ~15,000 most-starred / most-relied-on repos, along with a sampling of several thousand repos with few/no stars

Making all of this code essentially useless. You'd need to store those repos and their entire dependency tree.

Honestly, for however much this project either (a) is a genuine archeological move for the preservation of information or (b) to get good press, all I genuinely thought when this happened is "aw shucks - wish i fixed those bugs before they zapped it onto film and flew it to santa clause".
I am a web archivist with an archival project on Svalbard that predates this GitHub initiative.

Additionally, large-scale github-specific projects like https://gharchive.org (formerly GitHub Archive) have existed for some time.

In my experience, code is more likely than not to be preserved in a stale revision, if at all.

The most common forms of preservation are (a) simple tarballing and (b) git bundles.

Beautiful.

Honestly, nothing scares me more than losing all the code and all the technology we've developed in the past 70 or so years. There's been so much advancement, but it's also transferred in such a way (institutional knowledge, propietary software, proprietary hardware, etc.) that it's super easy to lose. If we preserve open hardware and software, then we could rebuild in the case of civilizational decline and the accompanying knowledge loss, something which we would neither be the first nor the last to experience.

> If we preserve open hardware and software, then we could rebuild in the case of civilizational decline and the accompanying knowledge loss

...can we?

I'm sometimes a little concerned about how complicated chip fabs are. They feel like something that could take generations to rebuild, even if we had all the knowledge on what to do.

Home photo-lithography and chemical etching setups aren't common, but have been done by several people. We wouldn't be able to jump straight to 14NM, but we would probably be able to get to the 500-300nm size relatively quickly (a year or two, maybe, if starting from scratch) and shrink down from there.

Devices would be much bigger and less efficient, but we would be able to run code and pump out 8086 processors within 6 months.

That's just one layer of the stack though. Future archaeologists will also need to create mock npm registries and maven repositories, and set up docker and k8s so they can deploy a complex set of microservices to look up our birthdays.
...all the code to which should be right in the Github Vault, right?

Idk, the hardware part seems much more difficult to me.

Thanks for the laugh! I needed that today :)
>Honestly, nothing scares me more than losing all the code and all the technology we've developed in the past 70 or so years

I think We'll be fine (as in, our species will survive). If we lose it all, we can rebuild. We've already proven that we're capable. The code is a just a record of our capabilities, not a barrier to entry.

> The Internet Archive is a well-known, widely beloved non-profit digital library which provides free public access to collections of digitized materials. In partnership with the GitHub Archive Program, the Internet Archive (IA) commenced its ongoing archive of GitHub public repositories on April 13 of this year. At present, IA is using a two-pronged approach. First, their well-known Wayback Machine is accessing and archiving raw GitHub data as WARCs, or Web ARChive files. As of this writing they have archived some 55TB of data. Second, they have the goal of making entire archived GitHub repositories available via “git clone,” while also keeping repo comments, issues, and other metadata easily accessible on the web. This second initiative is well underway and initial archiving is expected to commence this month.

Tremendous news.

This means that after the apocalypse people will be able to reclaim the Linux source code but not Windows. I find it poetic that open source may one day be the norm.
I’m thinking that someone at Microsoft may have snuck the code for Windows into the archive after it was pulled from Github. Between Windows and OS X, a ton (most?) of the end user software would be unusable to a future generation in its original form since they didn’t have the desktop OS it was used on.

Ironically, 500 years from now, they may think that the year of the Linux desktop was 2008 :-D

https://github.com/reactos/reactos would probably make this less of an issue as well.
There are copies of leaked Windows source code floating around... I've even seen it on GitHub but they probably get DMCA'd pretty quickly.
Since Microsoft aquired Github, I think they may also put some MS closed-source code in the vault. Seperate.
This is so awesome, but the most surprising to me is that all the public source code on GitHub only totals 21 TB.

I forget that they do fundamentally host text, and not video etc.

I somehow thought it would be petabytes. The private repos might be more than that but those are historically paid.

On the topic of size, I wonder how small it would be if you were able to deduplicate all repositories against each other. I sometimes suspect there is a tremendous amount of copy/paste code out there masquerading as someone else’s.

Even a naive deduplication might yield some very interesting results

Reminds me of a time I caught someone using someone else’s code in an interview and passing it off as their own. (Using was fine, it was the claim that it was theirs that bugged me)

I work at Software Heritage, where we archive all source code we can find, including all GitHub repositories, and deduplicate them internally.

The size of all file contents (including older versions of files) is a few hundreds TBs, and everything else (directory structures, revision history, etc.) is under 10TB.

So for GitHub alone it would be a little under that

They've just archived the HEAD of the 6000 most popular repos

> We’ve archived 6,000 of the world’s most popular repositories as a proof of concept for future archives.

> The snapshot will consist of the HEAD of the default branch of each repository, minus any binaries larger than 100KB in size.

Archive Program director here - the 6,000 repos were on the single proof-of-concept reel we archived last autumn. The full archive consists of millions of repos, including all repos with at least one star with any commits in the year leading up to 02/02/2020.
Hello Jon, the info page mentions binaries larger than 100KB are not archived. What about images of 500KB? I really am curious what these archived tar.xz files flook like. Would have been nice if the project site included an example of what retrieved data will look like. A lot of readme.md files have illustrations. Either way it's a cool project and I like what your team did.
where can we find the full list of the 6000 archive repo?
20 of those are probably node_modules folders
node_modules wouldn't make it to git repo. at least, the top 6000 repo on github. that's for sure.
Am I the only one thinking this is a waste of money and time?! How any of this makes sense, maybe as a weird PR stunt but ... Just strange
Take a look at the partners named at the bottom. From GitHub's perspective, this is in part a PR stunt. But the Internet Archive, and Software Heritage, are quite serious. The people in those organizations are there for the mission, and if they're in on this project it's because someone there believes it matters to their mission.

You're free not to care about what situation humans might be in in a thousand years or what they might need then. I can't say I spend any effort day to day working with that distant future in mind myself. But I'm glad there are people in our civilization who do.

I agree. I can't believe they are spending so much money and effort to preserve code I don't give a damn now and once I pushed to GitHub. And like me, 99% of the devs I know personally.
it's probably less effort to just archive the whole damn thing and let the future figure it out than to decide important things to archive and leaving everything else to disappear someday
Archive Program director here. One of the most interesting things our advisory committee told us is that it's really hard to determine what's important in advance: history is replete with lists composed by wealthy people of the books they thought most important, carefully preserved for posterity, whereas what modern historians _really_ want is ordinary people's shopping lists, of which almost none survived. That's one reason we cast a wide net and archived millions of repos instead of eg just the most-starred 100K..Even seemingly trivial repos might collectively be the modern technological equivalent of Renaissance shopping lists, for the historians who may take a particular interest in this (possibly) especially wacky and volatile era.
thank you so much for doing this work btw, archival is one of my loves :)
I wonder how much space you'd save if you excluded repos with only 1 star or only 1 commit.
They’ve excluded pretty much everything below a hundred stars, from what I see.
The inclusion criteria[0] were:

> The snapshot will include every repo with any commits between the announcement at GitHub Universe on November 13th and 02/02/2020, every repo with at least 1 star and any commits from the year before the snapshot (02/03/2019 - 02/02/2020), and every repo with at least 250 stars.

[0] https://archiveprogram.github.com

Have you ever read A Canticle for Leibowitz? It's fiction, but I can see this project being important for future civilization.
well in 25 years when the remains of global civilization are pulling the backups out of the ground, maybe you will think otherwise. /s

kinda with you on that one. looks cool, plausibly useful but we'll see.

Where can we find the list of the 6000 repos ? On my profile it just shows 3 "and more", would like to get the full list. TYIA ;)
This was corrected by @rezendi below:

> Archive Program director here - the 6,000 repos were on the single proof-of-concept reel we archived last autumn. The full archive consists of millions of repos, including all repos with at least one star with any commits in the year leading up to 02/02/2020.

Same. Or how they were picked. I kept scratching my head all evening cause I haven't made any updates or contributions to mine in quite a while.
My best guess is it's some function of the popularity. The three that my profile shows are

- capnproto/capnproto

- sandstorm-io/sandstorm

- erlang/otp

(I don't remember the order).

I actively contribute heavily to sandstorm. I've sent patches here and there to capnproto, and it's vaguely a sister project to sandstorm. Those are probably some of the most popular projects I have multiple contributions to, though there are others.

otp feels a bit odd though, if there's and "and more" -- I sent them a one line patch to fix a build error when building against musl. I haven't really been involved since, nor was I before. But it's a high profile project.

where did you get that 6000 repos number?
I'm really curious about this too. I haven't been able to find this information anywhere.
What I really want from github is to allow people who own open source projects who don't want to own them anymore to just hand them off for escrow so that at a later date a reputable group like apache can maintain them if needed.
If someone wants to give up their project and there isn't anyone in the community who wants to take over, the project is already dead. Open source doesn't work without people around to push it.
Couldn’t Apache just make a fork and announce it? Or is this just about the convenience and marketing?
If it's done this way then all of the web links stay live, and a new owner doesn't need to be found immediately. Think of it as a special permission holding pool. There are many cases of "done" libraries that need changes later. This would help with them. However when they're not done this way you can spend a few weeks / months trying to get ahold of the author and for them to decide "oh yeah I don't really care about x anymore"
I'm curious, do they perform some sort of test reads on the reels to make sure that the data was actually copied over correctly?
What stops this stored data degrading? Do they have to periodically check / renewal the reels?
I was hoping for more of a description on how they plan to keep this vault safer than the Global Seed Vault, which was once flooded due to soaring arctic temperatures: https://www.theguardian.com/environment/2017/may/19/arctic-s...
That was sensationalism, per usual. Bit of water in the access tunnel, no seed damage.
The story says right up front in the subhed that the flooding didn't reach the seeds. But the quotes make it pretty clear that what did happen was out of spec.

For something that's meant to survive any catastrophe that might happen over centuries to come, it's not a good sign to see that happen so early. It's extra bad to see it driven by a trend, namely global warming, that we're continuing to push farther and farther and have shown few signs of stopping.

It looks like the code is actually stored in plain text, and that this is basically microfilm?
Archive Program director here. It is basically microfilm (albeit very long-lived) but the data is mostly stored in a pixellated form, not unlike QR codes, although every reel also contains human-readable instructions (and code) re how to unpack its data.
I don't think so. Project Silica talks about storing the data in droplet-looking voxels rather than etching language symbols. Cool video of the process: https://www.youtube.com/watch?v=6CzHsibqpIs
But, from the article, it doesn't look like they used Project Silica here, they used piqlFilm.
Yeah, Project Silica is another project within the GitHub Archive Program. You can see the microfilm in this video

https://www.youtube.com/watch?v=fzI9FNjXQ0o&feature=youtu.be...

Strong A Fire Upon the Deep vibes thinking of future archeologists studying that.
> The next morning, it traveled to the decommissioned coal mine set in the mountain, and then to a chamber deep inside hundreds of meters of permafrost, where the code now resides fulfilling their mission of preserving the world’s open source code for over 1,000 years.

What is the probability that we still have the required tech to read that code in 1,000 years?

Depends on whether the Great Filter is before or behind us.
They __are not__ using QR code for storage as has been misreported by a few media outlets.

See https://earth.esa.int/documents/1656065/3222865/170922-Piql-... for piql's storage method.

Ice. Not to be confused with ICE.
GitHub is working with both? Very chilling.
Only disappointed that the new badge does not show the 2 open source projects I contributed to in the last 10 years of my work for open source :( They are not super big, but also not super small.

Seems organisation work is ignored and only individual username fork/PRs respected (is this a bug?). Software is teamwork ;)

I mean awesome-react, tldr-pages or homebrew-cask are probably not unimportant but that's not where I contributed most to.

I am not a huge GitHub user and have only contributed some code to a single repo that was merged. I was surprised to see I had the badge in my profile.
1000 years from now, I can only imagine the hidden Y3K bugs...
If things come to that I doubt the practicality of it all. But it makes easy headlines. It also makes open source an immortality project for a lot of people.
Ha, that README grammar fix years ago finally pays off!
In a 1000 years people will surely benefit from the millions of copy-pasted dotfiles :^)
This is a waste of money.
So, if I'm in the EU can I GDPR my repo out of their vault?
I'm guessing you're being facetious, but it has come up and it's covered in the FAQ: https://archiveprogram.github.com/faq/
"... archives have a special legal status under GDPR which protects them. GitHub’s Legal Team has approved the Archive Program."
This is pointless - a complete waste of time, effort, and energy. Isn’t there something more beneficial they could have done instead? Why pollute the Arctic with plastic and film canisters?