| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by cameroncairns 1956 days ago
	I think this article does an OK job covering how to scrape websites rendered serverside, but I strongly discourage people from scraping SPAs using a headless browser unless they absolutely have to. The article's author touches on this briefly, but you're far better off using the network tab in your browser's debug tools to see what AJAX requests are being made and figuring out how those APIs work. This approach results in far less server load for the target website as you don't need to request a bunch of other resources, reduces the overall bandwidth costs, and greatly speeds up the runtime of your script since you don't need to spend time running javascript in the headless browser. That can be especially slow if your script has to click/interact with elements on the page to get the results you need. Other than that, I'd strongly caution anyone looking into making parallel requests. Always keep in mind the sysadmin and engineers behind the site you are targeting. It's can be tempting to value your own time by making a ton of parallel requests to reduce the overall time of your script, but you can potentially cause massive server load for the site you're targeting. If that isn't enough motivation to cause you pause, keep in mind that the site owner is more likely to make the site hostile to scrapers if there are too many bad actors hitting the site heavily.

1 comments

jamra 1956 days ago

How would you deal with authentication?

link

ddorian43 1956 days ago

There was (is?) a DARPA project called "Memex" that was built to crawl the hidden web that has many tools like crawling with authentication, automatic registration, machine-learning to detect search-forms, auto detecting pagination etc etc etc etc https://github.com/darpa-i2o/memex-program-index

link

cameroncairns 1956 days ago

I don't! As far as I know, scraping data behind a login is illegal in the united states. You can look into the supreme court case Facebook v Powers Inc for information behind that. This page https://www.rcfp.org/scraping-not-violation-cfaa/ seems to have a decent overview of scraping laws in general. It's definitely a legal gray area so I'd suggest doing your research! This doesn't constitute legal advice and all that, I'm not a lawyer just a guy who does some scraping here and there :)

link