Hacker News new | ask | show | jobs
by myself248 1574 days ago
Oh jeez yeah. I've been going through https://github.com/ArchiveBox/ArchiveBox/wiki/Web-Archiving-... the last few days and I've concluded that none of 'em are appropriate for someone with my level of software ineptitude.
2 comments

For local archiving, I've been working on my own solution that's simply a background process running in your systray on Windows: https://irchiver.com/

There's some philosophical differences with ArchiveBox. 1) I'm more about automatic archiving of every web page, rather than the curation approach, 2) I prefer full-resolution screenshots over the actual source of the web page so you can save what you actually saw (so it works with dynamic pages, pages behind logins, etc.), 3) and I think full-text search is a key part of the archive so have implemented that.

wget --recursive --page-requisites --adjust-extension --convert-links --no-parent https://YOURWEBPAGEHEREX.com

NO "--convert-links" if you want a "pure" non local browsable copy.

Yes yes fine, and then I get throttled to 2 bytes/sec by the server. So I did some user-agent hijinks and set my delay to like 5000msec and that helped for a while, but my machine crashed and when I went to resume the task I was throttled again.
>but my machine crashed

Maybe it's not the servers who throttle you then ;)

Wget will exhaust all available ram on a long enough crawl.
No, i crawled many multi-gigabyte sites with my raspberry2 for days.
I've had memory exhaustion (on a 4GB system) after I think about 600GB in a single crawl. Splitting it into multiple crawls is of course better.

That was a site specifically set up to deal with large collections of files though.