| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by waprin 2152 days ago
	Its not too complicated, you just need a headless browser. Having done a ton of web scraping projects, I’d recommend just starting with this approach as even sites that look pretty static use Javascript in subtle ways.

2 comments

lyjackal 2151 days ago

Data is usually embedded in json or available from an internal api when it's an SPA. Headless browser resources are pretty huge. When doing large scale scraping, headless browser should be a last resort

link

hermanradtke 2151 days ago

Using a headless browser for scraping is a lot slower and resource intensive than parsing HTML.

link

PeterisP 2151 days ago

I don't find this as a concern - in all the scraping I've done, the only bottleneck was the intentional throttling/rate limiting, not the speed and resources spent by the headless browser; a small, cheap machine could easily process many, many times more requests than it would be reasonable to crawl.

link

sullyj3 2151 days ago

Sure, but it might be the only way to get the data.

link

hansvm 2151 days ago

It might be, but _starting_ a scraping project with a headless browser might be excessively expensive if you don't need the additional features.

link

lyjackal 2151 days ago

"only" is a bit of an overstatement. The data is always coming from somewhere, it just depends on how much effort needed to reverse engineer the JavaScript code path to the data

link