Hacker News new | ask | show | jobs
by OJFord 1753 days ago
> unofficial APIs (fancy word for scraping)

Using a non-public API is not at all the same as scraping, which refers to parsing a rendered HTML page for the content you want.

Both have this maintenance problem, but one's not a fancy word for the other.

1 comments

That used to be true, but today, with so many websites operating as SPAs against undocumented APIs, I think it's reasonable to redefine "scraping" to mean extracting data from unofficial APIs in addition to extracting it by parsing HTML.

After all, what is a scrapeable HTML page if not a grotesquely convoluted undocumented API with an unstable output format?

Scraping refers specifically to extracting data from a format designed to be read by humans instead of machines.

The gross inefficiency and low data-to-layout ratio are the key things being expressed through connotations of the word "scrape". To scrape is to extract a small amount of something from a much larger substrate.

To call every query a scrape is to diminish the specificity and utility of the term.

If an unofficial API returns JSON that looks like this:

    {
      "id": 3422,
      "title": "My essay about cheese",
      "published": "13th August 2021 at 3:45pm",
      "abstract": "<p>In which I write about cheese!</p>"
    }
And I write code against that which includes stripping the HTML tags from "abstract" and converting the date format in "published" into in ISO datetime... am I writing scraping code?

I would argue that I am, even though it started out as a JSON wrapper.

"To call every query a scrape is to diminish the specificity and utility of the term."

Absolutely disagree with you there. I interpret the term "scraping" as "writing code that gathers data from a source that has not deliberately published that data in a usable format". Gathering data from any kind of API fits that criteria for me, since most APIs only give you a subset of the data at a time.

I think the reason I care so much about this is that I coined the term "git scraping" to cover a variant of scraping that uses Git repositories to store the data and track changes over time - and git scraping applies equally to data sourced from APIs as it does to data sourced from HTML pages. https://simonwillison.net/2020/Oct/9/git-scraping/

If everyone insists that is what it means for long enough, then that is what it will mean.

The term was coined to differentiate how difficult it is to extract data from a format that was patently not intended to efficiently spread raw data to other machines. If that meaning erodes, and it's just yet another way to say an API query, it will be a great loss for the precision of our terminology.

Disagree. I see scraping as about obtaining data in bulk that hasn't been deliberately packaged up for you to use as-is.

Most APIs are not designed to give you all of the data at once - they exist to serve other purposes, usually involving returning a small subset of the data to power a user-facing feature.

If someone asks me "where did you get those Olympic medal results?" and I say "I scraped them" I think that's accurate vocabulary whether I parsed HTML or gathered them from hundreds of undocumented API calls.

If I had downloaded a neat CSV file from the Olympics website with all of the data I needed in one go I wouldn't feel comfortable calling it scraping.

Re-reading your comment, I think what I'm describing here does actually fit with your "how difficult it is to extract data from a format that was patently not intended to efficiently spread raw data to other machines" definition - except I'm including APIs that return only a subset of the data as part of those inefficiencies in obtaining the raw data.

APIs are rate-limited all the time. Compensating for being throttled or paginated isn't scraping.

Scraping is the act of extricating data from the layout and markup metadata meant to make it pretty for humans.

APIs generally don't include any of that, your HTML-in-a-JSON-object example notwithstanding.

I'd have no objection to calling it scraping when you strip those <P> tags, but aggregating the results of several API queries is bog-standard textbook API usage, which we use the term scraping to differentiate from.

So you get a video feed in the end, that is viewed by human presumably?
What you do with the data after you have it is not a qualification for whether or not you got it by scraping.