|
I've found the opposite to be true -- when an entity is maintaining an API and their website with the same data, the website is their core business. The API is prone to being incomplete, buggy, subject to sudden deprecation, unreasonably rate limited (crippling access to some objects below what a casual human user has), and so on. Conversely, overall document structure doesn't change much over time. I know it _can_; there's a social contract that APIs should change slowly while documents can change whenever, but that isn't what I observe in the wild. Even on fairly major redesigns, the overall structure has minimal edits. A technique I've used before (wasted effort in hindsight since web pages are stable and I never have to update my scrapers) is to come up with several semantically different ways of accessing a piece of data on a page. It serves two purposes; you can recover from small page changes by having the different methods vote, and you can detect most kinds of page changes by noticing discrepancies, notifying yourself that the scraper needs to be updated soon. |
Granted, but there are lots and lots of ways they can break scrapers in the pursuit of their core business, such as a website redesign. For example, moving from static HTML to a web framework would require your scraper to actually run the JavaScript to generate the DOM in the state that a reader might view it in, and this is quite a lot more complicated than walking the static HTML.