|
|
|
|
|
by xroche
1812 days ago
|
|
Glad the project helped a bit a few people :) I don't have much time unfortunately to enhance the engine nowadays, and the code is dirty and broken beyond any repair. Yet I'm still puzzled to see how many people are still using the project today. You'll probably find better approaches, and while I never tried scrapy, it seems to be using a javascript engine for hard cases, which was something I thought about (but this was way above my skills at that time). The hard parts remains however, if you want a functional site: you need to rewrite links, or use an external proxy-like mechanism. Having a fully functional offline, file-based site, is the real tricky part. Cases will remain unsolvable, as the inside code logic can produce whatever external link resource based on randomness, time, etc. The approach in httrack was both ugly and pragmatic: attempting to recognize link/files patterns within javascript and fetch/replace what can be replaced with local links. Javascript producing html will typically be analyzed with really dumb - yet sometimes effective - js parsers. (parental advisory: don't look at the parsers code, your eyes would melt) And obviously this is not going to solve all cases and will even break pages with tricky js |
|
httrack was extremely helpful and there really was no equal. The “modern” web requires a live JS engine, but as you point out, even the “old” web had server-side logic that couldn’t be captured.
In that light, I think httrack has stood up pretty well and nobody expects you to go rewrite it or clean it up. If someone today has a mostly static site they want to archive without writing custom code, I would still recommend httrack (it’s more controllable than wget or similar). I just assume that those sites are mostly gone :(.