| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by xroche 1812 days ago

Glad the project helped a bit a few people :) I don't have much time unfortunately to enhance the engine nowadays, and the code is dirty and broken beyond any repair. Yet I'm still puzzled to see how many people are still using the project today.

You'll probably find better approaches, and while I never tried scrapy, it seems to be using a javascript engine for hard cases, which was something I thought about (but this was way above my skills at that time).

The hard parts remains however, if you want a functional site: you need to rewrite links, or use an external proxy-like mechanism. Having a fully functional offline, file-based site, is the real tricky part. Cases will remain unsolvable, as the inside code logic can produce whatever external link resource based on randomness, time, etc.

The approach in httrack was both ugly and pragmatic: attempting to recognize link/files patterns within javascript and fetch/replace what can be replaced with local links. Javascript producing html will typically be analyzed with really dumb - yet sometimes effective - js parsers. (parental advisory: don't look at the parsers code, your eyes would melt)

And obviously this is not going to solve all cases and will even break pages with tricky js

3 comments

boulos 1812 days ago

Let me be clear: Thanks, Xavier!

httrack was extremely helpful and there really was no equal. The “modern” web requires a live JS engine, but as you point out, even the “old” web had server-side logic that couldn’t be captured.

In that light, I think httrack has stood up pretty well and nobody expects you to go rewrite it or clean it up. If someone today has a mostly static site they want to archive without writing custom code, I would still recommend httrack (it’s more controllable than wget or similar). I just assume that those sites are mostly gone :(.

link

robtherobber 1812 days ago

I'm using this software on and off, but it's especially useful when clients plan to redo their websites and I want to make sure I have a copy of the pages offline but don't have access to server backups or things like that.

But, generally speaking, being able to preserve "the internet" by saving whole websites offline should be something we give more attention to.

Just read this recently: https://www.theatlantic.com/technology/archive/2021/06/the-i...

link

enqk 1812 days ago

I still use it to mirror websites :) After all I also witnessed its creation! Dirty code can still be super useful

link