Hacker News new | ask | show | jobs
by hansvm 2116 days ago
I've found the opposite to be true -- when an entity is maintaining an API and their website with the same data, the website is their core business. The API is prone to being incomplete, buggy, subject to sudden deprecation, unreasonably rate limited (crippling access to some objects below what a casual human user has), and so on.

Conversely, overall document structure doesn't change much over time. I know it _can_; there's a social contract that APIs should change slowly while documents can change whenever, but that isn't what I observe in the wild. Even on fairly major redesigns, the overall structure has minimal edits.

A technique I've used before (wasted effort in hindsight since web pages are stable and I never have to update my scrapers) is to come up with several semantically different ways of accessing a piece of data on a page. It serves two purposes; you can recover from small page changes by having the different methods vote, and you can detect most kinds of page changes by noticing discrepancies, notifying yourself that the scraper needs to be updated soon.

11 comments

> the website is their core business

Granted, but there are lots and lots of ways they can break scrapers in the pursuit of their core business, such as a website redesign. For example, moving from static HTML to a web framework would require your scraper to actually run the JavaScript to generate the DOM in the state that a reader might view it in, and this is quite a lot more complicated than walking the static HTML.

Its not too complicated, you just need a headless browser. Having done a ton of web scraping projects, I’d recommend just starting with this approach as even sites that look pretty static use Javascript in subtle ways.
Data is usually embedded in json or available from an internal api when it's an SPA. Headless browser resources are pretty huge. When doing large scale scraping, headless browser should be a last resort
Using a headless browser for scraping is a lot slower and resource intensive than parsing HTML.
I don't find this as a concern - in all the scraping I've done, the only bottleneck was the intentional throttling/rate limiting, not the speed and resources spent by the headless browser; a small, cheap machine could easily process many, many times more requests than it would be reasonable to crawl.
Sure, but it might be the only way to get the data.
It might be, but _starting_ a scraping project with a headless browser might be excessively expensive if you don't need the additional features.
"only" is a bit of an overstatement. The data is always coming from somewhere, it just depends on how much effort needed to reverse engineer the JavaScript code path to the data
For example, moving from static HTML to a web framework would require your scraper to actually run the JavaScript to generate the DOM in the state that a reader might view it in

Or, as is often the case, the content is already there or fetched via an API in far more easily-consumed JSON format that you can use directly.

That’s my point.

Granted, lots of APIs make it prohibitively difficult to authenticate such that it’s easier to simply scrape. Such is the case with just about every Microsoft product I’ve ever used, most recently the XBox Live API. I genuinely wonder what kind of nonsense goes on in Microsoft design review meetings.

> moving from static HTML to a web framework

Looking at this sentence, I have the impression that it is nowadays taken for granted that "web framework" means "front end web framework". I come from a time in which it was perfectly fine to generate static HTML via a (server-side) web framework.

That's correct, I was referring to front-end web frameworks.
> this is quite a lot more complicated than walking the static HTML

Certainly more resource-intensive.

I've found that the breaking point is for websites that consume their own public APIs. On those, the API is usually very well maintained, documented, and stable.

Those that don't use their own APIs almost always end up with an open API in the state you describe (except maybe the very big players like FB, where the open API is overall good).

That is actually a good criteria for code quality in general. Don’t prepare a Java method which is not used. Because it will _never_ be right. Just implement today’s story, and leave the rest for another day. Same goes for rarely used functions: They are usually very buggy, where as function in the middle of everyone’s workflow are flawless. Hence the work of a good product owner is to streamline everyone as much as possible on a few central functions. But of course, in an enterprise environment, there are a few functions that are required to work (XML export, backup and restore...)
It's the why that holds that interests me.

The best explanation I've come up with is that as a naive developer, it's impossible to know the nuances of any sufficiently complex process or workflow.

I think it really depends on the type of product that the business makes money on. If one of the main products is data then I'd wager their API will have significantly more information compared to their website. If they make money via the website then yes, they're less inclined to spend resources on the API.

All of our own websites are built on-top of the same public API that everybody else uses and scraping used to be a nuisance. It was also confusing because they would be able to get more data using the same free account just by using the API instead of scraping. Exactly like the OP mentioned we only show a small number of properties via the website but most scrapers never took the time to actually compare API vs website.

I think this depends entirely on what you are indexing. From my experience with some 100 ish scrapers for news sites a few would break literally every day.

And the only thing we really wanted was article title and date.

My guess is that it depends if the API is seen as customer interface or implementation detail.

People are usually hesitant to constantly change how a customer interacts. All to willing to change internal details.

About 15 years ago, I worked on an add-on for a large online service, using their official, versioned, documented API.

We had to build a test suite just to verify that the API working as expected, because they would break it so often, and also the documentation didn't always match reality.

This was a paid API we used on a pay-per-use basis, IIRC, and had official support for.

In the beginning, we had a false sense of security about the version numbers and such. The first couple of breaks seemed to be "just this one time". Then we realized that it was happening all the time, and so the test suite. (I was a junior, so I can't take credit for this work, just a witness.)

API is often no more reliable than scraping human UI most of the time, with the added disadvantage of being second-level importance.

Personally, I've tried to combine human UI with API as much as possible. For example, I added a feature for being able to post via direct URL entry, like so: http://example.com/hello+world

most browsers will convert the spaces for you too, so you can just type your text into the address bar.

Approaching the web as an end user, I have also found this to be true. Most websites rarely change their document structure in such a way that breaks simple text-editing scripts. Keyword: "Most". In most cases no specialised tools or libararies are needed for extracting text or other resources. Again, keyword: "most". Personally, just because there may be a few exceptions does not mean I am going to change a strategy that works almost 100% of the time.

Understanding "web APIs", which did not exist when I first starting using the www in 1993, other than as a way to try to control and/or monetise scraping continues to escape me. I do like the increased usage of "endpoints" though, serving only data with no markup. Although XML and JSON are too bloated compared to something sensible like netstrings.

Similarly, on the client side, I fail to understand all the parsing tools and libraries and related promotion; it is just as easy to break any solution that depends on them and in many cases they are obviously overkill, more brittle than simple scripts using generalised text-editing tools.

One example is "jq". In many cases it is clearly overkill and is slower than sed.

https://stackoverflow.com/q/59806699

As a data source, the web is messy. "Standards" cannot be relied on 100%. Some people try to pretend the web is clean and can be tamed, or they "give up" because it is not "perfect" and things can break. Getting hands dirty works the best and most things do not break if kept simple, IME

I find that's the difference between public API endpoints and those APIs written for the page/app itself... if it's tightly coupled and the specs aren't published to the public for general use, I treat it as breakable.

That's just my own take. I've worked in environments with stronger versioning, and depending on your data needs and structures it can work. It's usually not worth it for most use cases though.

I've long wanted a really robust way of defining page areas for scraping, that could handle even relatively major HTML shifts.

My best idea has been to simply maintain a collection of "reference" URL's (e.g. of different products or articles) and identify unique start/end text for those specific instances.

Then automatically extract as many possible different "rules" for locating the desired content (pure structure and ordering, class hierarchies, classes/ids, surrounding text, etc.) and find the ones that are consistent across different instances.

And then just use those rules until they break on the reference page... and when they break, develop new ones.

I'm curious if anyone's built this type of thing?

I've seen a few academic papers and a few closed products that convert your selection of content you care about into a scraper capable of acquiring that content in the future. Last I checked there wasn't anything readily available as a FOSS library for doing so.

I'm having trouble finding those papers at the moment, but here are a couple commercial products that sound similar in spirit to what you're describing.

https://scraper.ai/

https://www.diffbot.com/ (kind of)

Edit: I hadn't searched recently enough. See the sibling comment recommending this library. Haven't used it yet, but at first glance it looks nice. https://github.com/alirezamika/autoscraper/

I've tried autoscraper now, and I don't like it (not yet).

(1) Its wrapper generation code isn't much more advanced than that similar data will be similarly nested in similar parent blocks. It looks more brittle than I'd like.

(2) It has zero tests, comments, docstrings, types, or any other niceties so far (and minimal documentation).

(3) When things go wrong it strongly prefers returning no information and not throwing any errors. None of the examples in the README actually run (or rather, they give you a `None` response that's all but useless) without changes.

despite being undocumented, it really works well. I tried the readme examples and all work. maybe you didn't update the wanted list in the examples because it has changed in the page. IMO the biggest problem is the lack of js enabled content support.
> it really works well

It works well enough. I tried a few other sites and had mixed results even when providing the raw HTML so that I knew its http logic wasn't the issue.

> maybe you didn't update the wanted list in the examples

Yeah, that was my only real problem with the readme examples. Those could just as easily be provided as local data (e.g. how `sklearn.datasets` works) so that the end user starts with working code, especially since there are no errors/warnings/etc when anything goes wrong.

> IMO the biggest problem is the lack of js enabled content support.

Haha, unless I'm seriously misunderstanding you this is one of the only things I don't mind :) Since you can pass raw html to the library, you can use your favorite headless browser to navigate (or in the happy case just load a non-interactive js-enabled site) to your content and pass it through to this library to do the data extraction. I rather like those features being decoupled and kind of wish this library didn't attempt to do any of the crawling itself. I know that's just a personal preference, but it's my account, so I'll say what I like about it.

didn't see that feature, makes sense now :)
This project was recently submitted to r/python: https://github.com/alirezamika/autoscraper/
I've had good luck just running this sort of thing as an offline process, especially for external dependencies. We used to get 'pink' for this lookup and now it's "<span>pink </span>" or somesuch.

It depends on the SLA, of course, but it's cheaper to check every few hours than on every request, and you get a couple of alerts instead of a constant stream of them.

Interesting approach using multiple locators, might use it in the future. Although in my limited experience I agree that the avg site which usually needs scraping keeps its "interface" mostly unchanged.

I guess a lot of stuff I've needed to scrape are from old CMSes or sites where its viewed as part of a cost center they're unlikely to invest in and just maintain it