Hacker News new | ask | show | jobs
by uniqueuid 1577 days ago
Yes!

Archiving is important, we have already seen so much online history gone down the drain or just accidentally saved.

Large institutions like the internet archive are doing an admirable job, but there is a lot of content that they cannot and will not cover. So we will definitely (also) need volunteer-based archival for the foreseeable future.

18TB drives are ~$300 a piece right now, go buy one and help our collective memory!

2 comments

ArchiveTeam sends archives to Internet Archive but the two are not related. I don't think you confused the two but I mention this every time just in case.

The Warrior is a small Docker image that downloads files via your ISP connection and forwards them to the AT servers. No need for large drives.

For my personal use, I have a home server install of https://github.com/ArchiveBox/ArchiveBox and for that one you may want to get some storage, though I prefer to host its data on the SSD for performance reasons (my archive grows approx. 5000 items or 150GB per year). It's like a private Internet Archive on your home network.

Thanks, it's always good to point that out.

There's a surprising amount of tools that are able to submit data to the internet archive (and get data from there). Even wget can produce WARC archive files.

While the warrior downloads content via your line (a bit like a residential proxy network), I do think it's important that we decentralize the storage as well.

Just without the crypto mafia/drug traders/investors.

AFAIK you can use IPFS (& clusters[0]) without relying on the crypto parts of that ecosystem. That ought to fit rather well with the use case.

[0] https://cluster.ipfs.io/

Yes there are some really interesting projects, also in the ML replicability space.

One really nice approach is the DAT project [1]. The protocol [2] looks pretty sensible and useful. Unfortunately, the tooling has been in such a state of permanent flux (i.e. perpetual deprecation) that I've never bothered to invest much time.

[1] https://datproject.org/

[1] https://datproject.org/

The last time I tried to do anything with or for the Archive Team, it was a mostly "just watch us work" sort of deal.

The tools couldn't be built without additional knowledge that wasn't published anywhere -- because there had been drift from what was published versus what was working, and those changes never got folded back in. And there were multiple versions and variants of the tools, with different teams using different versions or variants.

And once you built the tools, you couldn't get your Warrior into the list to be used, although you could always run your systems separately.

It's not like you could sign up for a SETI@Home type initiative and just let your equipment run.

I understand why they work this way. It's a very insular crowd, and new people and resources seem to disappear as quickly as they showed up.

So, they let you watch.

If you stay around long enough (months? years?), then they might let you start participating. But I wasn't willing to wait that long.

I kind of wonder how we can make it searchable again. Is this included in this archiving effort?

In any case wonderful work.

There is a standard set of tooling for indexing archives: CDX files. [1]

They index WARC archives and can be used to quickly find records. You can build on top of this (and some systems do) to make a proper search front-end.

But in general, these archives are NOT geared towards full-blown search because it would be pretty expensive to keep the indexes in hot cache. Plus you would need to deal with historical versions of records, which is not normally done in search UX.

[1] https://wiki.archiveteam.org/index.php/The_WARC_Ecosystem#CD...

Ah, is the WARC format the reason it's called 'Warrior'? It seems like a very strange name for an archival program.
ArchiveTeam seems very guerrilla in their operations.

I always imagined the Warrior as a camo-faced archivist operating under cover of darkness, preserving data even in the most hostile Yahoo-occupied territory.

Thank you for that information!