| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by simonw 1759 days ago

Disagree. I see scraping as about obtaining data in bulk that hasn't been deliberately packaged up for you to use as-is.

Most APIs are not designed to give you all of the data at once - they exist to serve other purposes, usually involving returning a small subset of the data to power a user-facing feature.

If someone asks me "where did you get those Olympic medal results?" and I say "I scraped them" I think that's accurate vocabulary whether I parsed HTML or gathered them from hundreds of undocumented API calls.

If I had downloaded a neat CSV file from the Olympics website with all of the data I needed in one go I wouldn't feel comfortable calling it scraping.

Re-reading your comment, I think what I'm describing here does actually fit with your "how difficult it is to extract data from a format that was patently not intended to efficiently spread raw data to other machines" definition - except I'm including APIs that return only a subset of the data as part of those inefficiencies in obtaining the raw data.

1 comments

addingnumbers 1759 days ago

APIs are rate-limited all the time. Compensating for being throttled or paginated isn't scraping.

Scraping is the act of extricating data from the layout and markup metadata meant to make it pretty for humans.

APIs generally don't include any of that, your HTML-in-a-JSON-object example notwithstanding.

I'd have no objection to calling it scraping when you strip those <P> tags, but aggregating the results of several API queries is bog-standard textbook API usage, which we use the term scraping to differentiate from.

link

simonw 1758 days ago

I think we can at least agree that there is no formal definition of "scraping".

That said, I had a look around and the definitions I could find tended to support my interpretation:

https://en.wikipedia.org/wiki/Web_scraping - "Newer forms of web scraping involve monitoring data feeds from web servers. For example, JSON is commonly used as a transport storage mechanism between the client and the web server."

https://towardsdatascience.com/web-scraping-basics-82f8b5acd... - "There are 2 different approaches for web scraping depending on how does website structure their contents." (HTML scraping and API access)

https://realpython.com/beautiful-soup-web-scraper-python/ - "Web scraping is the process of gathering information from the Internet. Even copying and pasting the lyrics of your favorite song is a form of web scraping! However, the words “web scraping” usually refer to a process that involves automation"

The more formal dictionaries (Merriam Webster and suchlike) don't seem to have formed an opinion on this one yet!

link

addingnumbers 1758 days ago

I think the definitions you're cherry-picking are examples of the erosion of the term's specificity, We need a word to describe scraping data values from a body of human-oriented markup. These folks are pushing the limits of ambiguity to rob us of that, and our reward is yet another word that just means using any API.

What is the upside of using this word in such an oddly vague, expansive way? What happens to those of us who need to convey the original specific meaning we coined it for in the first place?

link

simonw 1758 days ago

"We need a word to describe scraping data values from a body of human-oriented markup"

I call that parsing, which is one of the steps in scraping that may or not be necessary depending on the data source.

I need a term that means "using automation to gather data from the web, when that data has not been published in a way that is suitable for my purposes". Scraping works great for that!

link