Hacker News new | ask | show | jobs
by TBurette 1496 days ago
Is there a good way to combine Scrapy framework (retry, rate limiting,..) with a headless browser such as selenium (to get full js-loaded client-side data)?

When I had to do it I ended up duplicating each page request twice. Once for scrapy and once again with selenium.

1 comments

You can use something like scrapy-playwright[0] to run a headless browser framework as your download handler. I think there are versions for some of the other headless systems, if you prefer those.

[0] https://github.com/scrapy-plugins/scrapy-playwright

scrapy-playwright is good, and Playwright is awesome. However due to the architecture of Playwright it just keeps accumulating memory until it crashes. You will want to set up your scraper to save its state regularly, cleanly shut down and restart. But once you have that working it does work well.