Hacker News new | ask | show | jobs
by nix23 1574 days ago
wget --recursive --page-requisites --adjust-extension --convert-links --no-parent https://YOURWEBPAGEHEREX.com

NO "--convert-links" if you want a "pure" non local browsable copy.

1 comments

Yes yes fine, and then I get throttled to 2 bytes/sec by the server. So I did some user-agent hijinks and set my delay to like 5000msec and that helped for a while, but my machine crashed and when I went to resume the task I was throttled again.
>but my machine crashed

Maybe it's not the servers who throttle you then ;)

Wget will exhaust all available ram on a long enough crawl.
No, i crawled many multi-gigabyte sites with my raspberry2 for days.
I've had memory exhaustion (on a 4GB system) after I think about 600GB in a single crawl. Splitting it into multiple crawls is of course better.

That was a site specifically set up to deal with large collections of files though.