| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by matt_morgan 4132 days ago
	Do you know what I found out over the last few days? There's no simple tool that you can use to download the actual content of your website, you know, for migrating it to a new CMS or whatever. Unbelievable. Something that will just run a text extraction through `wget -r` and save it all. Boilerpipe does the extraction nicely, but nobody has turned it into a simple tool. You just have to have a job and try to get stuff done for a while and this kind of thing comes up. Just wait and watch.

1 comments

Borealid 4132 days ago

Do you mean like httrack ( http://httrack.com )?

If you're talking about the source for dynamic pages, you can use any file copier like rsync. But httrack is your go-to if you're just talking about downloading a web site mirror image.

link

pjc50 4131 days ago

I think he means smarter: given a bunch of CMS pages which are text content (different per page) surrounded by (semi-fixed) boilerplate, extract all the content nicely for re-importation.

It's a bit of a one-off, though.

link

Immortalin 4131 days ago

Try a combination of Curl/wget/httrack with Pup (https://github.com/ericchiang/pup/)

link

matt_morgan 4131 days ago

Thanks everyone. Httrack is awesome, but yes, I mean smarter. Pup looks cool. I want the result to be something that turns 200 pages of staff bios into something I can pay someone $15/hr to copy-paste quickly into the new CMS. Boilerpipe does it nicely, but doesn't do the whole job without wget and some scripting, plus it costs money or is complicated (it's in Apache Tika, I guess).

But back on-topic, all I really mean to say is that something like this happens to me like every other week. Productize your scripts.

link