| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by Tactic 2331 days ago

The issue here for some, if not many, is a matter of scale. It is one thing if an end-user, whom I am trying to service, comes to my site and gets my publicly available data. Maybe I monetize with ads, maybe not. It doesn't matter, that is the audience I am trying to service, regardless of size.

But when you scrape it my load goes up dramatically. A load I have to pay for.

It is analogous to the privacy debates going on with one said saying "hey, don't track everywhere I go and tag me with facial recognition" and the other side saying "hey, you are in public and people can see you." The issue is not complete privacy, but one of scale. And of intent.

I believe society is soon going to have to come to grips with the scale of things and legislate what are acceptable scales of action as it seems to be becoming a large issue in a growing number are areas.

3 comments

hannasanarion 2331 days ago

So you throttle your users. We have http status codes for "too many requests" and all scraper software comes with a delay setting by default. Everybody who does scraping is supposed to know that its rude to blast a thousand requests per second.

link

munk-a 2331 days ago

This ruling has left open a big question of how much you need to spend to support scrapers and where the line between scraping and a DoS attack lies - and that's going to be a weird line. If my site is producing a big report off of data that changes quarterly then re-downloading that report every 20 minutes is possibly excessive and might wander into the realm of an attack - while if we looked at the same frequency with twitter it seems a lot more reasonable - maybe even a bit on the slow side.

link

orf 2331 days ago

Provide an API for public data to reduce the costs associated with rendering a full blown page, and deliver just the information needed.

link

Tactic 2331 days ago

Entirely feasible. Also reasonable for you to pay me for the service as it is taking my development efforts to meet your business model. The advantage to you is you have a defined interface that I won't prevent.

link

orf 2331 days ago

I guess you missed the comment I was replying to: it may cost you more money, in bandwidth and per page resources, to not provide an API than it does for you to provide one.

So no, I won’t pay you for the privilege of you saving money.

link

oauea 2331 days ago

No, I'll happily just scrape your site instead. But if you'd rather not have that happen, provide an API.

link

briandear 2331 days ago

Who pays for that API and the bandwidth? What’s in it for the data provider? On LinkedIn, viewing the data now shows ads or at least prompts the viewer to join the network. With scrapping and free API access, how exactly does LinkedIn benefit for their work of hosting the data?

link

0x445442 2331 days ago

My guess is hiQ (and others) would happily pay for an API over the data they're scraping right now.

link

gregmac 2331 days ago

Unless the costs exceed their current operational costs. Don't forget the time spent redeveloping on the new API, which includes validating everything is there, testing and cleaning up and removing the old (working) code.

link

munificent 2331 days ago

Why buy the cow when you get the milk for free?

link

munk-a 2331 days ago

This isn't a great analogy here - getting the data delivered via API is simply more useful than having to re-assemble that data out of fragments parsed off of different web calls.

Could I suggest:

"Why buy the cheese when you get the milk for free?"

link

coryfklein 2331 days ago

Look, if you build a product that relies on providing free information to the public, then you don't get to select a segment of that public and charge them for it. You can't hang a billboard on a highway, but then get upset when some people look at it the wrong way.

Now, if you want to have a walled garden and charge for entry to some and let others in free then that is fine.

link

dariosalvi78 2331 days ago

The contention was not about load, it was about using the data.

link