| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by simonw 1800 days ago

If an unofficial API returns JSON that looks like this:

    {
      "id": 3422,
      "title": "My essay about cheese",
      "published": "13th August 2021 at 3:45pm",
      "abstract": "<p>In which I write about cheese!</p>"
    }

And I write code against that which includes stripping the HTML tags from "abstract" and converting the date format in "published" into in ISO datetime... am I writing scraping code?

I would argue that I am, even though it started out as a JSON wrapper.

"To call every query a scrape is to diminish the specificity and utility of the term."

Absolutely disagree with you there. I interpret the term "scraping" as "writing code that gathers data from a source that has not deliberately published that data in a usable format". Gathering data from any kind of API fits that criteria for me, since most APIs only give you a subset of the data at a time.

I think the reason I care so much about this is that I coined the term "git scraping" to cover a variant of scraping that uses Git repositories to store the data and track changes over time - and git scraping applies equally to data sourced from APIs as it does to data sourced from HTML pages. https://simonwillison.net/2020/Oct/9/git-scraping/

1 comments

addingnumbers 1800 days ago

If everyone insists that is what it means for long enough, then that is what it will mean.

The term was coined to differentiate how difficult it is to extract data from a format that was patently not intended to efficiently spread raw data to other machines. If that meaning erodes, and it's just yet another way to say an API query, it will be a great loss for the precision of our terminology.

link

simonw 1800 days ago

Disagree. I see scraping as about obtaining data in bulk that hasn't been deliberately packaged up for you to use as-is.

Most APIs are not designed to give you all of the data at once - they exist to serve other purposes, usually involving returning a small subset of the data to power a user-facing feature.

If someone asks me "where did you get those Olympic medal results?" and I say "I scraped them" I think that's accurate vocabulary whether I parsed HTML or gathered them from hundreds of undocumented API calls.

If I had downloaded a neat CSV file from the Olympics website with all of the data I needed in one go I wouldn't feel comfortable calling it scraping.

Re-reading your comment, I think what I'm describing here does actually fit with your "how difficult it is to extract data from a format that was patently not intended to efficiently spread raw data to other machines" definition - except I'm including APIs that return only a subset of the data as part of those inefficiencies in obtaining the raw data.

link

addingnumbers 1800 days ago

APIs are rate-limited all the time. Compensating for being throttled or paginated isn't scraping.

Scraping is the act of extricating data from the layout and markup metadata meant to make it pretty for humans.

APIs generally don't include any of that, your HTML-in-a-JSON-object example notwithstanding.

I'd have no objection to calling it scraping when you strip those <P> tags, but aggregating the results of several API queries is bog-standard textbook API usage, which we use the term scraping to differentiate from.

link

simonw 1800 days ago

I think we can at least agree that there is no formal definition of "scraping".

That said, I had a look around and the definitions I could find tended to support my interpretation:

https://en.wikipedia.org/wiki/Web_scraping - "Newer forms of web scraping involve monitoring data feeds from web servers. For example, JSON is commonly used as a transport storage mechanism between the client and the web server."

https://towardsdatascience.com/web-scraping-basics-82f8b5acd... - "There are 2 different approaches for web scraping depending on how does website structure their contents." (HTML scraping and API access)

https://realpython.com/beautiful-soup-web-scraper-python/ - "Web scraping is the process of gathering information from the Internet. Even copying and pasting the lyrics of your favorite song is a form of web scraping! However, the words “web scraping” usually refer to a process that involves automation"

The more formal dictionaries (Merriam Webster and suchlike) don't seem to have formed an opinion on this one yet!

link

addingnumbers 1800 days ago

I think the definitions you're cherry-picking are examples of the erosion of the term's specificity, We need a word to describe scraping data values from a body of human-oriented markup. These folks are pushing the limits of ambiguity to rob us of that, and our reward is yet another word that just means using any API.

What is the upside of using this word in such an oddly vague, expansive way? What happens to those of us who need to convey the original specific meaning we coined it for in the first place?

link

simonw 1800 days ago

"We need a word to describe scraping data values from a body of human-oriented markup"

I call that parsing, which is one of the steps in scraping that may or not be necessary depending on the data source.

I need a term that means "using automation to gather data from the web, when that data has not been published in a way that is suitable for my purposes". Scraping works great for that!

link