Hacker News new | ask | show | jobs
by throwaway894345 2105 days ago
> the website is their core business

Granted, but there are lots and lots of ways they can break scrapers in the pursuit of their core business, such as a website redesign. For example, moving from static HTML to a web framework would require your scraper to actually run the JavaScript to generate the DOM in the state that a reader might view it in, and this is quite a lot more complicated than walking the static HTML.

4 comments

Its not too complicated, you just need a headless browser. Having done a ton of web scraping projects, I’d recommend just starting with this approach as even sites that look pretty static use Javascript in subtle ways.
Data is usually embedded in json or available from an internal api when it's an SPA. Headless browser resources are pretty huge. When doing large scale scraping, headless browser should be a last resort
Using a headless browser for scraping is a lot slower and resource intensive than parsing HTML.
I don't find this as a concern - in all the scraping I've done, the only bottleneck was the intentional throttling/rate limiting, not the speed and resources spent by the headless browser; a small, cheap machine could easily process many, many times more requests than it would be reasonable to crawl.
Sure, but it might be the only way to get the data.
It might be, but _starting_ a scraping project with a headless browser might be excessively expensive if you don't need the additional features.
"only" is a bit of an overstatement. The data is always coming from somewhere, it just depends on how much effort needed to reverse engineer the JavaScript code path to the data
For example, moving from static HTML to a web framework would require your scraper to actually run the JavaScript to generate the DOM in the state that a reader might view it in

Or, as is often the case, the content is already there or fetched via an API in far more easily-consumed JSON format that you can use directly.

That’s my point.

Granted, lots of APIs make it prohibitively difficult to authenticate such that it’s easier to simply scrape. Such is the case with just about every Microsoft product I’ve ever used, most recently the XBox Live API. I genuinely wonder what kind of nonsense goes on in Microsoft design review meetings.

> moving from static HTML to a web framework

Looking at this sentence, I have the impression that it is nowadays taken for granted that "web framework" means "front end web framework". I come from a time in which it was perfectly fine to generate static HTML via a (server-side) web framework.

That's correct, I was referring to front-end web frameworks.
> this is quite a lot more complicated than walking the static HTML

Certainly more resource-intensive.