Hacker News new | ask | show | jobs
by op7 454 days ago
This isnt 1998 anymore so downloading the files from modern websites doesn't really work if youre trying to maintain your own private local / re-hosted copy of a site. especially ones with dynamically loaded content. Some additional processing is needed to fix the files. I have never been able to find a modern scraping solution that works with most modern websites. I suppose the existence of this sort of tool is in conflict of interest of Big Tech, for it would make the creation of visually identical looking phishing sites that easier much.
1 comments

There definitely are tools for scraping basically any site by using the browser itself to make sure all dynamically loaded stuff gets intercepted correctly. Browsertrix[0] is probably the most well known and complete scraper for that. They offer it as a paid service for convenient setup but its open source and can be self-hosted as well.

0: https://webrecorder.net/browsertrix/

Interesting, never had heard of them before. Pricing looks reasonable except for the time limit being per month. Daily limit sounds much more practical. How do people use that in a useful way?

Does anyone have experience self-hosting this in the cloud? I'd worry about run-away traffic cost but since ingress is cheap most of the time maybe this is not a big problem?