Hacker News new | ask | show | jobs
by ugjka 1753 days ago
For a project like youtube-dl it is a long time, because they use unofficial APIs (fancy word for scraping) of video sites that can shift even on daily basis. If you look at their Github issues it is just people endlessly complaining that some websites are broken again
2 comments

> unofficial APIs (fancy word for scraping)

Using a non-public API is not at all the same as scraping, which refers to parsing a rendered HTML page for the content you want.

Both have this maintenance problem, but one's not a fancy word for the other.

That used to be true, but today, with so many websites operating as SPAs against undocumented APIs, I think it's reasonable to redefine "scraping" to mean extracting data from unofficial APIs in addition to extracting it by parsing HTML.

After all, what is a scrapeable HTML page if not a grotesquely convoluted undocumented API with an unstable output format?

Scraping refers specifically to extracting data from a format designed to be read by humans instead of machines.

The gross inefficiency and low data-to-layout ratio are the key things being expressed through connotations of the word "scrape". To scrape is to extract a small amount of something from a much larger substrate.

To call every query a scrape is to diminish the specificity and utility of the term.

If an unofficial API returns JSON that looks like this:

    {
      "id": 3422,
      "title": "My essay about cheese",
      "published": "13th August 2021 at 3:45pm",
      "abstract": "<p>In which I write about cheese!</p>"
    }
And I write code against that which includes stripping the HTML tags from "abstract" and converting the date format in "published" into in ISO datetime... am I writing scraping code?

I would argue that I am, even though it started out as a JSON wrapper.

"To call every query a scrape is to diminish the specificity and utility of the term."

Absolutely disagree with you there. I interpret the term "scraping" as "writing code that gathers data from a source that has not deliberately published that data in a usable format". Gathering data from any kind of API fits that criteria for me, since most APIs only give you a subset of the data at a time.

I think the reason I care so much about this is that I coined the term "git scraping" to cover a variant of scraping that uses Git repositories to store the data and track changes over time - and git scraping applies equally to data sourced from APIs as it does to data sourced from HTML pages. https://simonwillison.net/2020/Oct/9/git-scraping/

If everyone insists that is what it means for long enough, then that is what it will mean.

The term was coined to differentiate how difficult it is to extract data from a format that was patently not intended to efficiently spread raw data to other machines. If that meaning erodes, and it's just yet another way to say an API query, it will be a great loss for the precision of our terminology.

Disagree. I see scraping as about obtaining data in bulk that hasn't been deliberately packaged up for you to use as-is.

Most APIs are not designed to give you all of the data at once - they exist to serve other purposes, usually involving returning a small subset of the data to power a user-facing feature.

If someone asks me "where did you get those Olympic medal results?" and I say "I scraped them" I think that's accurate vocabulary whether I parsed HTML or gathered them from hundreds of undocumented API calls.

If I had downloaded a neat CSV file from the Olympics website with all of the data I needed in one go I wouldn't feel comfortable calling it scraping.

Re-reading your comment, I think what I'm describing here does actually fit with your "how difficult it is to extract data from a format that was patently not intended to efficiently spread raw data to other machines" definition - except I'm including APIs that return only a subset of the data as part of those inefficiencies in obtaining the raw data.

So you get a video feed in the end, that is viewed by human presumably?
What you do with the data after you have it is not a qualification for whether or not you got it by scraping.
Sounds like an exhausting thing to maintain. It's not like writing scraping (or even just changing slight variations in an API) is terribly interesting.
True, but it’s also the kind of product that’s instantly useful and itch-scratchy. Youtube-dl not working on the video you’re downloading today? Well if you’re a maintainer you can just patch it yourself. (Non-maintainers can too of course, but I imagine the maintainers have the know-how to actually fix things)