|
|
|
|
|
by reaperman
805 days ago
|
|
There’s currently only one situation where scraping is almost definitely “not legal”: If the information you’re scraping requires a login, and if in order to get a login you have to agree to a terms of service, and that terms of service forbids you from scraping — then you could have a bad day in civil court if the website you’re scraping decides to sue you. If the data is publicly accessible without a login then scraping is 99% safe with no legal issues, even if you ignore robots.txt. You might still end up in court if you found a way to correctly guess non-indexed URLs[0] but you’d probably prevail in the end (…probably). The “purpose” of robots.txt is to let crawlers know what they can do without getting ip-banned by the website operator that they’re scraping. Generally crawlers that ignore robots.txt and also act more like robots than humans, will get an IP ban. 0: https://www.troyhunt.com/enumerationis-enumerating-resources... |
|
Also OpenAI's entire business model is relying on generous interpretations of various IP laws, so I suspect they already have a mature legal division to handle these sorts of potential issues.