Hacker News new | ask | show | jobs
by RNAlfons 1574 days ago
Make it an easy installable/runable Windows application and it will spread like wildfire.
2 comments

If it was only that easy. To make distributed archiving as high quality as possible, you need reproducible environments as much as possible, which is why the "official" way of participating is to run virtual machines, instead of directly on the host.

Not sure why this 3rd party is the submission site rather than the official page, which is this: https://wiki.archiveteam.org/index.php/ArchiveTeam_Warrior

Has a couple of different installation methods as well.

Yep, Using Virtual box is rather easy to get the warrior running!
Why even require that? If the data in question is available over HTTP, it should be as easy as opening a page from the relevant origin in a browser tab, optionally opening a second tab for a "Warrior Dashboard", then invoking a bookmarklet on the former to slurp up data by XHR &tc. (If it's necessary to cross origins as the thing roves around, the dashboard can alert you to this while it continues doing what it can with the first origin. Just have the human return to the dashboard from time to time and repeat the second step to run as many in parallel as they want.)
Full Archival with the standards required by the Internet Archive require that full unmodified headers are required, and unmodified content. This tends not to work well with modern browsers. Chrome and Firefox both fail at this currently. Someone is looking into a kind of modified Firefox to help with this. but its just not that how this system works. Now the Archive.org does have a API of sorts to say hay archive this URL, and a little working on the backend goes and does it..

What the Archive Team does is on a much more massive scale. Like SETI at home scale of scraping data across the internet. At almost every point we have had to make custom tools to ensure it meets our needs in our archival efforts.

> standards required by the Internet Archive require that full unmodified headers are required

Sure, this would not be a solution for the Wayback Machine, but would be adequate[1][2] for lots of non-Wayback collections (of the sort that Archive Team is associated with).

1. https://twitter.com/textfiles/status/970912494284779520

2. http://ascii.textfiles.com/archives/4285

Similar: github.com/InternetArchive/warcprox
That would be awesome, do you think you could write that?
I'd definitely be interested in working on getting as close as possible if the grant money were to appear.