Hacker News new | ask | show | jobs
by addingnumbers 1753 days ago
APIs are rate-limited all the time. Compensating for being throttled or paginated isn't scraping.

Scraping is the act of extricating data from the layout and markup metadata meant to make it pretty for humans.

APIs generally don't include any of that, your HTML-in-a-JSON-object example notwithstanding.

I'd have no objection to calling it scraping when you strip those <P> tags, but aggregating the results of several API queries is bog-standard textbook API usage, which we use the term scraping to differentiate from.

1 comments

I think we can at least agree that there is no formal definition of "scraping".

That said, I had a look around and the definitions I could find tended to support my interpretation:

https://en.wikipedia.org/wiki/Web_scraping - "Newer forms of web scraping involve monitoring data feeds from web servers. For example, JSON is commonly used as a transport storage mechanism between the client and the web server."

https://towardsdatascience.com/web-scraping-basics-82f8b5acd... - "There are 2 different approaches for web scraping depending on how does website structure their contents." (HTML scraping and API access)

https://realpython.com/beautiful-soup-web-scraper-python/ - "Web scraping is the process of gathering information from the Internet. Even copying and pasting the lyrics of your favorite song is a form of web scraping! However, the words “web scraping” usually refer to a process that involves automation"

The more formal dictionaries (Merriam Webster and suchlike) don't seem to have formed an opinion on this one yet!

I think the definitions you're cherry-picking are examples of the erosion of the term's specificity, We need a word to describe scraping data values from a body of human-oriented markup. These folks are pushing the limits of ambiguity to rob us of that, and our reward is yet another word that just means using any API.

What is the upside of using this word in such an oddly vague, expansive way? What happens to those of us who need to convey the original specific meaning we coined it for in the first place?

"We need a word to describe scraping data values from a body of human-oriented markup"

I call that parsing, which is one of the steps in scraping that may or not be necessary depending on the data source.

I need a term that means "using automation to gather data from the web, when that data has not been published in a way that is suitable for my purposes". Scraping works great for that!