|
|
|
|
|
by simonw
1753 days ago
|
|
If an unofficial API returns JSON that looks like this: {
"id": 3422,
"title": "My essay about cheese",
"published": "13th August 2021 at 3:45pm",
"abstract": "<p>In which I write about cheese!</p>"
}
And I write code against that which includes stripping the HTML tags from "abstract" and converting the date format in "published" into in ISO datetime... am I writing scraping code?I would argue that I am, even though it started out as a JSON wrapper. "To call every query a scrape is to diminish the specificity and utility of the term." Absolutely disagree with you there. I interpret the term "scraping" as "writing code that gathers data from a source that has not deliberately published that data in a usable format". Gathering data from any kind of API fits that criteria for me, since most APIs only give you a subset of the data at a time. I think the reason I care so much about this is that I coined the term "git scraping" to cover a variant of scraping that uses Git repositories to store the data and track changes over time - and git scraping applies equally to data sourced from APIs as it does to data sourced from HTML pages. https://simonwillison.net/2020/Oct/9/git-scraping/ |
|
The term was coined to differentiate how difficult it is to extract data from a format that was patently not intended to efficiently spread raw data to other machines. If that meaning erodes, and it's just yet another way to say an API query, it will be a great loss for the precision of our terminology.