| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by alganet 625 days ago
	Nope. It is for the classic web (the only websites worth saving anyway).

1 comments

freedomben 625 days ago

Even for classic web, if it's behind cloudflare, then HTTrack no longer works.

It's a sad point to be at. Fortunately, the single file extension still works really well for single pages, even when they are built dynamically by JavaScript on the client side. There isn't a solution for cloning an entire site though, at least that I know of

link

alganet 625 days ago

If it is cloudflare human verification, then httrack will have an issue. But in the end it's just a cookie, you can use a browser with JS to grab the cookie, then feed it to httrack headers.

If cloudflare ddos protection is an issue, you can throttle httrack requests.

link

_lvbh 625 days ago

> you can use a browser with JS to grab the cookie, then feed it to httrack headers

They also check your user agent, IP and JA3 fingerprint (and ensures it matches with the one that got the cookie) so it's not as simple as copying some cookies. This might just be for paying customers though since it doesn't do such heavy checks for some sites

link

alganet 624 days ago

Dude. Cookie is a header, user agent is a header, ja3 is a header. It's the same stuff.

These protections are against ddos attacks, botnets, large crawling infrastructures that can lose by having to sync header info.

If you're just a single tired dev saving a website because you care about some content, none of this is a significant barrier.

link

_lvbh 624 days ago

Dude. JA3 is a your TLS fingerprint. Most libraries don't let you spoof it. The annoying thing is that with new versions of Chrome and Firefox, JA3 is randomized per session so it changes every time. You need to intercept the request in Wireshark to get it.

link

freedomben 625 days ago

Seconded. It seems to depend on the sites settings, and those in turn are regulated heavily by subscription plan the site is on.

link

knowaveragejoe 625 days ago

I'm aware of this tool, but I'm sure there are caveats in terms of "totally" cloning a website:

https://github.com/ArchiveTeam/grab-site

link