| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by causi 1574 days ago
	Warrior is great for the community effort, but I wish someone would put some work into a modern local site archiver. HTTRACK just doesn't cut it anymore.

2 comments

myself248 1574 days ago

Oh jeez yeah. I've been going through https://github.com/ArchiveBox/ArchiveBox/wiki/Web-Archiving-... the last few days and I've concluded that none of 'em are appropriate for someone with my level of software ineptitude.

link

lazyjeff 1574 days ago

For local archiving, I've been working on my own solution that's simply a background process running in your systray on Windows: https://irchiver.com/

There's some philosophical differences with ArchiveBox. 1) I'm more about automatic archiving of every web page, rather than the curation approach, 2) I prefer full-resolution screenshots over the actual source of the web page so you can save what you actually saw (so it works with dynamic pages, pages behind logins, etc.), 3) and I think full-text search is a key part of the archive so have implemented that.

link

nix23 1574 days ago

wget --recursive --page-requisites --adjust-extension --convert-links --no-parent https://YOURWEBPAGEHEREX.com

NO "--convert-links" if you want a "pure" non local browsable copy.

link

myself248 1574 days ago

Yes yes fine, and then I get throttled to 2 bytes/sec by the server. So I did some user-agent hijinks and set my delay to like 5000msec and that helped for a while, but my machine crashed and when I went to resume the task I was throttled again.

link

nix23 1573 days ago

>but my machine crashed

Maybe it's not the servers who throttle you then ;)

link

traverseda 1573 days ago

Wget will exhaust all available ram on a long enough crawl.

link

nix23 1572 days ago

No, i crawled many multi-gigabyte sites with my raspberry2 for days.

link

TheTechRobo 1574 days ago

There's github.com/ArchiveTeam/grab-site, but unfortunately it's not maintained very well.

link