Hacker News new | ask | show | jobs
by alganet 625 days ago
Nope. It is for the classic web (the only websites worth saving anyway).
1 comments

Even for classic web, if it's behind cloudflare, then HTTrack no longer works.

It's a sad point to be at. Fortunately, the single file extension still works really well for single pages, even when they are built dynamically by JavaScript on the client side. There isn't a solution for cloning an entire site though, at least that I know of

If it is cloudflare human verification, then httrack will have an issue. But in the end it's just a cookie, you can use a browser with JS to grab the cookie, then feed it to httrack headers.

If cloudflare ddos protection is an issue, you can throttle httrack requests.

> you can use a browser with JS to grab the cookie, then feed it to httrack headers

They also check your user agent, IP and JA3 fingerprint (and ensures it matches with the one that got the cookie) so it's not as simple as copying some cookies. This might just be for paying customers though since it doesn't do such heavy checks for some sites

Dude. Cookie is a header, user agent is a header, ja3 is a header. It's the same stuff.

These protections are against ddos attacks, botnets, large crawling infrastructures that can lose by having to sync header info.

If you're just a single tired dev saving a website because you care about some content, none of this is a significant barrier.

Dude. JA3 is a your TLS fingerprint. Most libraries don't let you spoof it. The annoying thing is that with new versions of Chrome and Firefox, JA3 is randomized per session so it changes every time. You need to intercept the request in Wireshark to get it.
Seconded. It seems to depend on the sites settings, and those in turn are regulated heavily by subscription plan the site is on.
I'm aware of this tool, but I'm sure there are caveats in terms of "totally" cloning a website:

https://github.com/ArchiveTeam/grab-site