Hacker News new | ask | show | jobs
by brushfoot 805 days ago
Web scraping the public Internet is legal, at least in the U.S.

hiQ's public scraping of LinkedIn was ruled to be within their rights and not a violation of the CFAA. I imagine that's why LinkedIn has almost everything behind an auth wall now.

Scraping auth-walled data is different. When you sign up, you have to check "I agree to the terms," and the terms generally say, "You can't scrape us." So, you can't just make a million bot accounts that take an app's data (legally, anyway). Those EULAs are generally legally enforceable in the U.S.

Some sites have terms at the bottom that prohibit scraping—but my understanding is that those aren't generally enforceable if the user doesn't have to take any action to accept or acknowledge them.

3 comments

Most of these SaaS's have a "firehose" that if you are big enough (aka, can handle the firehose), can subscribe to. These are like RSS feeds on crack for their entire SaaS.

- https://developer.twitter.com/en/docs/twitter-api/enterprise...

- https://developer.wordpress.com/docs/firehose/

> Scraping auth-walled data is different. When you sign up, you have to check "I agree to the terms," and the terms generally say, "You can't scrape us." So, you can't just make a million bot accounts that take an app's data (legally, anyway). Those EULAs are generally legally enforceable in the U.S.

They're legally enforceable in the sense that the scraped services generally reserve the right to terminate the authorizing account at will, or legally enforceable in that allowing someone to scrape you with your credentials (or scraping using someone else's) qualifies as violating the CFAA?

hiQ was found to be in violation of the User Agreement in the end.

Basically, in the end, it was essentially a breach of contract.

Exactly, that was my point.

hiQ's public scraping was found to be legal. It was the logged-in scraping that was the problem.

The logged-in scraping was a breach of contract, as you said.

The former is fine; the latter is not.

What OpenAI is doing here is the former, which companies are perfectly within their rights to do.