Hacker News new | ask | show | jobs
by grishka 1999 days ago
> the risk of web scrapers getting their proprietary data

That's some weird logic, to me at least. That data is literally given away to everyone but some people or organizations can't have it? If you want to control access to it, maybe at least require people to register before they can see it? Is it even proprietary if it's public with no access control whatsoever?

This for-profit internet is just really such a parallel universe to me.

3 comments

This is a question the courts are working through with LinkedIn and HiQ https://www.eff.org/deeplinks/2019/09/victory-ruling-hiq-v-l... as well as the Van Buren case https://themarkup.org/news/2020/12/03/why-web-scraping-is-vi...

It’s a different world where there are no laws or prices or contracts really.

> This for-profit internet is just really such a parallel universe to me.

I know I have been a contrary commentor in this thread, but I hear you with this. What a monster we have built, and what always gets me is how trivial everything is. So much capital is flowing through these ephemeral software systems that, if gone tomorrow, would be ultimately inconsequential to mankind.

I mean it's ridiculous to think about it, but there's this giant, many-billion-dollar online marketing industry that I essentially don't exist for. If it's gone tomorrow, I would indeed not notice, but it'd be the end of the world for some.

> and what always gets me is how trivial everything is

Whenever I read about corporations and how they work, I always inevitably ask myself the question "where the hell does enough work to keep this many people busy even come from". Everything is ridiculously overengineered to meet imaginary deadlines.

> That data is literally given away to everyone but some people or organizations can't have it?

It's often a question of quantity. LinkedIn probably doesn't care about you scraping a few profiles, but if you're harvesting every bit of their publicly-available data, then they get a little scared that you're building something that's going to compete with them.

Same with Instagram, or Facebook, for example. Though in this case it's probably more of a user-privacy issue - at least that's what they say.

It's not really weird logic to me - seems to make sense.

> If you want to control access to it, maybe at least require people to register

Most of the time they can't do this because they need the Google traffic. LinkedIn wants a result in the SERP for Bob Smith when you search for "Bob Smith" because that helps them get signups. Google won't list the page if that content is gated by a sign-in/register page.