| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by minimaxir 3740 days ago

I'm not fond of the implication at the end that scraping is justifiable because old websites are dinosaurs without APIs, and those websites are jerks for not doing so, and therefore scraping is the moral thing to do.

I've scraped my share of BuzzFeed data and Foursquare data to make data visualizations (with the latter explicitly saying "don't scrape" in their Terms). But if either one told me to stop and take down my results, I would not contest, since data is what drives the Internet ecosystem.

(For the record, neither service did; in fact, both tried to recruit me as a result of the visualizations. The difference is that I am not using the data to create a direct competitor that could cause them to lose business.)

4 comments

kh_hk 3740 days ago

Disclaimer, I wrote the article.

> I'm not fond of the implication at the end that scraping is justifiable because old websites are dinosaurs without APIs, and those websites are jerks for not doing so, and therefore scraping is the moral thing to do.

It was not my intention to give that implication. The main implication behind CityBikes is that public services should already provide this information since, well, it is a public service. On the same line, a private company providing a public service should already do so. See motives [1].

> I've scraped my share of BuzzFeed data and Foursquare data to make data visualizations (with the latter explicitly saying "don't scrape" in their Terms). But if either one told me to stop and take down my results, I would not contest, since data is what drives the Internet ecosystem.

Same as CityBikes is doing. If we receive a cease and desist, we remove their service from our API. As for Foursquare, I do not see Foursquare as a public service. Your taxdollars at work, and all that.

I tried to keep the article balanced but maybe it wasn't clear. There are many transportation companies willing and happy to be scraped, or looking forward to provide their information for people to reuse [2].

[1]: https://blog.scrapinghub.com/2016/03/30/web-scraping-to-crea...

[2]: http://nabsa.net/current-members/

link

seanp2k2 3739 days ago

Why does your blog intentionally crash browsers that it thinks are Safari?

link

mryan 3739 days ago

You appear to have replied to the wrong comment. ScrapingHub is not the site that attempts to crash Safari - that is weboob.

link

fucking_tragedy 3739 days ago

What do you mean by intentionally?

link

kh_hk 3740 days ago

> For the record, neither service did; in fact, both tried to recruit me as a result of the visualizations. The difference is that I am not using the data to create a direct competitor that could cause them to lose business.

I do not understand that implication. How is providing bike share information creating a competitor? I can't run a bike sharing service.

link

minimaxir 3740 days ago

Less a competitor, more a non-canonical source of information that they cannot manage.

If your offshoot were to misrepresent data, for example, then you would become a liability, even if you weren't making money.

link

chrisweekly 3740 days ago

Random tangent: your comment about unmanaged non-canonical misrepresented data reminds me of Zillow, who publish "facts" about real estate transactions, with no mechanism for error correction. Afrw years ago they posted an erroneous sale of my house -- which transaction never took place -- listing a sale price 20% lower than we'd paid for it. It directly harmed me, when we later tried to sell the house, when potential buyers cited zillow's "estimates" which of course were artificially, drastically lower because of the phantom transaction. There was no avenue for recourse; angry tweets got a half-baked response from a junior social media person, but it was never resolved. I wonder how many others zillow must have messed up.

link

kh_hk 3740 days ago

That's a fair point. Easily enforceable by a proper license. One example is ETALAB open data license.

link

jdc 3739 days ago

Why do you think scraping needs justifying in the first place?

link

unsettledtck 3740 days ago

Out of curiosity, where are the boundaries of your gray area when it comes to scraping?

link

minimaxir 3740 days ago

If the service has an API with a fair rate limit (Foursquare does at 5000 requests/hour), I believe that is ok, since that implies their architecture is built for massive data requests. On the other hand, bypassing those rate limits with proxies is definitely bad.

If a website does not have an API (BuzzFeed), I take care to only collect data that I need. Not anything that would damage the business. (E.g entire articles). Consequently, I sanitize the data of such things if I decide to release the dataset.

link