Hacker News new | ask | show | jobs
by matt_morgan 4085 days ago
Do you know what I found out over the last few days? There's no simple tool that you can use to download the actual content of your website, you know, for migrating it to a new CMS or whatever. Unbelievable. Something that will just run a text extraction through `wget -r` and save it all. Boilerpipe does the extraction nicely, but nobody has turned it into a simple tool.

You just have to have a job and try to get stuff done for a while and this kind of thing comes up. Just wait and watch.

1 comments

Do you mean like httrack ( http://httrack.com )?

If you're talking about the source for dynamic pages, you can use any file copier like rsync. But httrack is your go-to if you're just talking about downloading a web site mirror image.

I think he means smarter: given a bunch of CMS pages which are text content (different per page) surrounded by (semi-fixed) boilerplate, extract all the content nicely for re-importation.

It's a bit of a one-off, though.

Try a combination of Curl/wget/httrack with Pup (https://github.com/ericchiang/pup/)
Thanks everyone. Httrack is awesome, but yes, I mean smarter. Pup looks cool. I want the result to be something that turns 200 pages of staff bios into something I can pay someone $15/hr to copy-paste quickly into the new CMS. Boilerpipe does it nicely, but doesn't do the whole job without wget and some scripting, plus it costs money or is complicated (it's in Apache Tika, I guess).

But back on-topic, all I really mean to say is that something like this happens to me like every other week. Productize your scripts.