Hacker News new | ask | show | jobs
by causi 1574 days ago
Warrior is great for the community effort, but I wish someone would put some work into a modern local site archiver. HTTRACK just doesn't cut it anymore.
2 comments

Oh jeez yeah. I've been going through https://github.com/ArchiveBox/ArchiveBox/wiki/Web-Archiving-... the last few days and I've concluded that none of 'em are appropriate for someone with my level of software ineptitude.
For local archiving, I've been working on my own solution that's simply a background process running in your systray on Windows: https://irchiver.com/

There's some philosophical differences with ArchiveBox. 1) I'm more about automatic archiving of every web page, rather than the curation approach, 2) I prefer full-resolution screenshots over the actual source of the web page so you can save what you actually saw (so it works with dynamic pages, pages behind logins, etc.), 3) and I think full-text search is a key part of the archive so have implemented that.

wget --recursive --page-requisites --adjust-extension --convert-links --no-parent https://YOURWEBPAGEHEREX.com

NO "--convert-links" if you want a "pure" non local browsable copy.

Yes yes fine, and then I get throttled to 2 bytes/sec by the server. So I did some user-agent hijinks and set my delay to like 5000msec and that helped for a while, but my machine crashed and when I went to resume the task I was throttled again.
>but my machine crashed

Maybe it's not the servers who throttle you then ;)

Wget will exhaust all available ram on a long enough crawl.
No, i crawled many multi-gigabyte sites with my raspberry2 for days.
There's github.com/ArchiveTeam/grab-site, but unfortunately it's not maintained very well.