Hacker News new | ask | show | jobs
by Tactic 2331 days ago
The issue here for some, if not many, is a matter of scale. It is one thing if an end-user, whom I am trying to service, comes to my site and gets my publicly available data. Maybe I monetize with ads, maybe not. It doesn't matter, that is the audience I am trying to service, regardless of size.

But when you scrape it my load goes up dramatically. A load I have to pay for.

It is analogous to the privacy debates going on with one said saying "hey, don't track everywhere I go and tag me with facial recognition" and the other side saying "hey, you are in public and people can see you." The issue is not complete privacy, but one of scale. And of intent.

I believe society is soon going to have to come to grips with the scale of things and legislate what are acceptable scales of action as it seems to be becoming a large issue in a growing number are areas.

3 comments

So you throttle your users. We have http status codes for "too many requests" and all scraper software comes with a delay setting by default. Everybody who does scraping is supposed to know that its rude to blast a thousand requests per second.
This ruling has left open a big question of how much you need to spend to support scrapers and where the line between scraping and a DoS attack lies - and that's going to be a weird line. If my site is producing a big report off of data that changes quarterly then re-downloading that report every 20 minutes is possibly excessive and might wander into the realm of an attack - while if we looked at the same frequency with twitter it seems a lot more reasonable - maybe even a bit on the slow side.
Provide an API for public data to reduce the costs associated with rendering a full blown page, and deliver just the information needed.
Entirely feasible. Also reasonable for you to pay me for the service as it is taking my development efforts to meet your business model. The advantage to you is you have a defined interface that I won't prevent.
I guess you missed the comment I was replying to: it may cost you more money, in bandwidth and per page resources, to not provide an API than it does for you to provide one.

So no, I won’t pay you for the privilege of you saving money.

No, I'll happily just scrape your site instead. But if you'd rather not have that happen, provide an API.
Who pays for that API and the bandwidth? What’s in it for the data provider? On LinkedIn, viewing the data now shows ads or at least prompts the viewer to join the network. With scrapping and free API access, how exactly does LinkedIn benefit for their work of hosting the data?
My guess is hiQ (and others) would happily pay for an API over the data they're scraping right now.
Unless the costs exceed their current operational costs. Don't forget the time spent redeveloping on the new API, which includes validating everything is there, testing and cleaning up and removing the old (working) code.
Why buy the cow when you get the milk for free?
This isn't a great analogy here - getting the data delivered via API is simply more useful than having to re-assemble that data out of fragments parsed off of different web calls.

Could I suggest:

"Why buy the cheese when you get the milk for free?"

Look, if you build a product that relies on providing free information to the public, then you don't get to select a segment of that public and charge them for it. You can't hang a billboard on a highway, but then get upset when some people look at it the wrong way.

Now, if you want to have a walled garden and charge for entry to some and let others in free then that is fine.

The contention was not about load, it was about using the data.