Hacker News new | ask | show | jobs
by samwillis 1258 days ago
robots.txt is not legally binding, neither are the "terms" on a website when it come to screen scraping or automated access.

A "robots.txt for AI" would be nothing other than a polite request that will be ignored by the vast majority of organisations.

Under the current understanding of the law and copyright, the only preventative measure is putting content behind a wall with an explicit user agreement to access it. Effectively, if it's readable by a human without having to actively agree to a license, it can be scraped and used for any purpose, as long as it's not reproduced verbatim.

What we need is a better understanding of Copywright and data mining in law. We need test cases.

2 comments

Putting your content behind a login of some kind does mean you can bind people to terms though. I suspect what we'll start seeing happen is that places where artists/writers/etc congregate will start requiring authentication and agreement to some terms to even see things.
Exactly, and sadly thats what's coming, more walls between users and content.

Those annoying cookie banners, thats just the beginning, all websites will eventually have an explicit license wall to access them.

Even if it's not legally binding, it will be, and should be embarrassing for a big corp to do things that they are specifically not asked to do.

For example, Google doesn't really surface those pages hidden by robots.txt, even if they can.