| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by op7 454 days ago
	This isnt 1998 anymore so downloading the files from modern websites doesn't really work if youre trying to maintain your own private local / re-hosted copy of a site. especially ones with dynamically loaded content. Some additional processing is needed to fix the files. I have never been able to find a modern scraping solution that works with most modern websites. I suppose the existence of this sort of tool is in conflict of interest of Big Tech, for it would make the creation of visually identical looking phishing sites that easier much.

1 comments

stuffoverflow 454 days ago

There definitely are tools for scraping basically any site by using the browser itself to make sure all dynamically loaded stuff gets intercepted correctly. Browsertrix[0] is probably the most well known and complete scraper for that. They offer it as a paid service for convenient setup but its open source and can be self-hosted as well.

0: https://webrecorder.net/browsertrix/

link

weinzierl 454 days ago

Interesting, never had heard of them before. Pricing looks reasonable except for the time limit being per month. Daily limit sounds much more practical. How do people use that in a useful way?

Does anyone have experience self-hosting this in the cloud? I'd worry about run-away traffic cost but since ingress is cheap most of the time maybe this is not a big problem?

link