|
I admit being a bit disappointed that a well-known disadvantage of web scraping was not mentioned: Web scraping is fragile! Web sites change, web frameworks evolve, and just some subtle reordering of some <divs> or renaming of CSS classes, and your perfect scraping code from yesterday will break tomorrow -- maybe not leaving you empty-handed, but probably missing some data or delivering the wrong one. If there is an API you can use, use it. If your budget allows to pay for API access, buy it. APIs tend to be more stable than scraping, and the data provider will probably inform you if it changes. Contacting them might even get you more interesting data, as not every column they have in their database might become published on the web site. |
Conversely, overall document structure doesn't change much over time. I know it _can_; there's a social contract that APIs should change slowly while documents can change whenever, but that isn't what I observe in the wild. Even on fairly major redesigns, the overall structure has minimal edits.
A technique I've used before (wasted effort in hindsight since web pages are stable and I never have to update my scrapers) is to come up with several semantically different ways of accessing a piece of data on a page. It serves two purposes; you can recover from small page changes by having the different methods vote, and you can detect most kinds of page changes by noticing discrepancies, notifying yourself that the scraper needs to be updated soon.