| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by cxr 1574 days ago
	Why even require that? If the data in question is available over HTTP, it should be as easy as opening a page from the relevant origin in a browser tab, optionally opening a second tab for a "Warrior Dashboard", then invoking a bookmarklet on the former to slurp up data by XHR &tc. (If it's necessary to cross origins as the thing roves around, the dashboard can alert you to this while it continues doing what it can with the first origin. Just have the human return to the dashboard from time to time and repeat the second step to run as many in parallel as they want.)

3 comments

jrwr 1574 days ago

Full Archival with the standards required by the Internet Archive require that full unmodified headers are required, and unmodified content. This tends not to work well with modern browsers. Chrome and Firefox both fail at this currently. Someone is looking into a kind of modified Firefox to help with this. but its just not that how this system works. Now the Archive.org does have a API of sorts to say hay archive this URL, and a little working on the backend goes and does it..

What the Archive Team does is on a much more massive scale. Like SETI at home scale of scraping data across the internet. At almost every point we have had to make custom tools to ensure it meets our needs in our archival efforts.

link

cxr 1574 days ago

> standards required by the Internet Archive require that full unmodified headers are required

Sure, this would not be a solution for the Wayback Machine, but would be adequate[1][2] for lots of non-Wayback collections (of the sort that Archive Team is associated with).

1. https://twitter.com/textfiles/status/970912494284779520

2. http://ascii.textfiles.com/archives/4285

link

TheTechRobo 1574 days ago

Similar: github.com/InternetArchive/warcprox

link

myself248 1574 days ago

That would be awesome, do you think you could write that?

link

cxr 1574 days ago

I'd definitely be interested in working on getting as close as possible if the grant money were to appear.

link