Hacker News new | ask | show | jobs
by minimaxir 3740 days ago
I'm not fond of the implication at the end that scraping is justifiable because old websites are dinosaurs without APIs, and those websites are jerks for not doing so, and therefore scraping is the moral thing to do.

I've scraped my share of BuzzFeed data and Foursquare data to make data visualizations (with the latter explicitly saying "don't scrape" in their Terms). But if either one told me to stop and take down my results, I would not contest, since data is what drives the Internet ecosystem.

(For the record, neither service did; in fact, both tried to recruit me as a result of the visualizations. The difference is that I am not using the data to create a direct competitor that could cause them to lose business.)

4 comments

Disclaimer, I wrote the article.

> I'm not fond of the implication at the end that scraping is justifiable because old websites are dinosaurs without APIs, and those websites are jerks for not doing so, and therefore scraping is the moral thing to do.

It was not my intention to give that implication. The main implication behind CityBikes is that public services should already provide this information since, well, it is a public service. On the same line, a private company providing a public service should already do so. See motives [1].

> I've scraped my share of BuzzFeed data and Foursquare data to make data visualizations (with the latter explicitly saying "don't scrape" in their Terms). But if either one told me to stop and take down my results, I would not contest, since data is what drives the Internet ecosystem.

Same as CityBikes is doing. If we receive a cease and desist, we remove their service from our API. As for Foursquare, I do not see Foursquare as a public service. Your taxdollars at work, and all that.

I tried to keep the article balanced but maybe it wasn't clear. There are many transportation companies willing and happy to be scraped, or looking forward to provide their information for people to reuse [2].

[1]: https://blog.scrapinghub.com/2016/03/30/web-scraping-to-crea...

[2]: http://nabsa.net/current-members/

Why does your blog intentionally crash browsers that it thinks are Safari?
You appear to have replied to the wrong comment. ScrapingHub is not the site that attempts to crash Safari - that is weboob.
What do you mean by intentionally?
> For the record, neither service did; in fact, both tried to recruit me as a result of the visualizations. The difference is that I am not using the data to create a direct competitor that could cause them to lose business.

I do not understand that implication. How is providing bike share information creating a competitor? I can't run a bike sharing service.

Less a competitor, more a non-canonical source of information that they cannot manage.

If your offshoot were to misrepresent data, for example, then you would become a liability, even if you weren't making money.

Random tangent: your comment about unmanaged non-canonical misrepresented data reminds me of Zillow, who publish "facts" about real estate transactions, with no mechanism for error correction. Afrw years ago they posted an erroneous sale of my house -- which transaction never took place -- listing a sale price 20% lower than we'd paid for it. It directly harmed me, when we later tried to sell the house, when potential buyers cited zillow's "estimates" which of course were artificially, drastically lower because of the phantom transaction. There was no avenue for recourse; angry tweets got a half-baked response from a junior social media person, but it was never resolved. I wonder how many others zillow must have messed up.
That's a fair point. Easily enforceable by a proper license. One example is ETALAB open data license.
Why do you think scraping needs justifying in the first place?
Out of curiosity, where are the boundaries of your gray area when it comes to scraping?
If the service has an API with a fair rate limit (Foursquare does at 5000 requests/hour), I believe that is ok, since that implies their architecture is built for massive data requests. On the other hand, bypassing those rate limits with proxies is definitely bad.

If a website does not have an API (BuzzFeed), I take care to only collect data that I need. Not anything that would damage the business. (E.g entire articles). Consequently, I sanitize the data of such things if I decide to release the dataset.