Hacker News new | ask | show | jobs
by Borealid 4084 days ago
Do you mean like httrack ( http://httrack.com )?

If you're talking about the source for dynamic pages, you can use any file copier like rsync. But httrack is your go-to if you're just talking about downloading a web site mirror image.

1 comments

I think he means smarter: given a bunch of CMS pages which are text content (different per page) surrounded by (semi-fixed) boilerplate, extract all the content nicely for re-importation.

It's a bit of a one-off, though.

Try a combination of Curl/wget/httrack with Pup (https://github.com/ericchiang/pup/)
Thanks everyone. Httrack is awesome, but yes, I mean smarter. Pup looks cool. I want the result to be something that turns 200 pages of staff bios into something I can pay someone $15/hr to copy-paste quickly into the new CMS. Boilerpipe does it nicely, but doesn't do the whole job without wget and some scripting, plus it costs money or is complicated (it's in Apache Tika, I guess).

But back on-topic, all I really mean to say is that something like this happens to me like every other week. Productize your scripts.