|
|
|
|
|
by cameroncairns
1956 days ago
|
|
I think this article does an OK job covering how to scrape websites rendered serverside, but I strongly discourage people from scraping SPAs using a headless browser unless they absolutely have to. The article's author touches on this briefly, but you're far better off using the network tab in your browser's debug tools to see what AJAX requests are being made and figuring out how those APIs work. This approach results in far less server load for the target website as you don't need to request a bunch of other resources, reduces the overall bandwidth costs, and greatly speeds up the runtime of your script since you don't need to spend time running javascript in the headless browser. That can be especially slow if your script has to click/interact with elements on the page to get the results you need. Other than that, I'd strongly caution anyone looking into making parallel requests. Always keep in mind the sysadmin and engineers behind the site you are targeting. It's can be tempting to value your own time by making a ton of parallel requests to reduce the overall time of your script, but you can potentially cause massive server load for the site you're targeting. If that isn't enough motivation to cause you pause, keep in mind that the site owner is more likely to make the site hostile to scrapers if there are too many bad actors hitting the site heavily. |
|