| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by reaperman 805 days ago

There’s currently only one situation where scraping is almost definitely “not legal”:

If the information you’re scraping requires a login, and if in order to get a login you have to agree to a terms of service, and that terms of service forbids you from scraping — then you could have a bad day in civil court if the website you’re scraping decides to sue you.

If the data is publicly accessible without a login then scraping is 99% safe with no legal issues, even if you ignore robots.txt. You might still end up in court if you found a way to correctly guess non-indexed URLs[0] but you’d probably prevail in the end (…probably).

The “purpose” of robots.txt is to let crawlers know what they can do without getting ip-banned by the website operator that they’re scraping. Generally crawlers that ignore robots.txt and also act more like robots than humans, will get an IP ban.

0: https://www.troyhunt.com/enumerationis-enumerating-resources...

1 comments

ToucanLoucan 805 days ago

Also worth noting there's a long history of companies with deep pockets getting away with murder (sometimes literally) because litigation in a system that costs money to engage with inherently favors the wealthier party.

Also OpenAI's entire business model is relying on generous interpretations of various IP laws, so I suspect they already have a mature legal division to handle these sorts of potential issues.

link