Hacker News new | ask | show | jobs
by waprin 2105 days ago
Its not too complicated, you just need a headless browser. Having done a ton of web scraping projects, I’d recommend just starting with this approach as even sites that look pretty static use Javascript in subtle ways.
2 comments

Data is usually embedded in json or available from an internal api when it's an SPA. Headless browser resources are pretty huge. When doing large scale scraping, headless browser should be a last resort
Using a headless browser for scraping is a lot slower and resource intensive than parsing HTML.
I don't find this as a concern - in all the scraping I've done, the only bottleneck was the intentional throttling/rate limiting, not the speed and resources spent by the headless browser; a small, cheap machine could easily process many, many times more requests than it would be reasonable to crawl.
Sure, but it might be the only way to get the data.
It might be, but _starting_ a scraping project with a headless browser might be excessively expensive if you don't need the additional features.
"only" is a bit of an overstatement. The data is always coming from somewhere, it just depends on how much effort needed to reverse engineer the JavaScript code path to the data