Hacker News new | ask | show | jobs
by nerdjon 819 days ago
I am curious, do we have any evidence that AI is adhering to robots.txt and isn’t ignoring it since they are not technically crawling in the traditional sense?

Even if they are right now it would be a quick switch for them to just ignore it.

3 comments

I have examples in my logs of GPTBot fetching only /robots.txt, and nothing from the same /24 block fetched anything else after that, so it seems at least that bot respects robots.txt.

Maybe your question is "how do we know if whatever system GPTBot feeds downstream didn't just get your content via something else that crawl your site?" I am not sure we have anything to defend against those, other than signalling via robots.txt to say that our content is not intended for AI use.

Internet Archive's crawler is not respecting robots.txt because they want to archive everything not just parts of the Web. But if you are actively breaking robots.txt then your crawler will have a bad reputation and you will have an army of webmasters trying to block your crawler by any means. You can see crawling requests in your sever logs, that's how you know if they are respecting it or not.

Imo, they best solution would be to license your content so crawlers pay a fee for crawling and using your content.

Well TIL that IA does not respect robots.txt.

Does IA themselves block crawlers? It doesn't look like it according to their robots.txt, even going so far as to say "Please crawl our files."

What would stop an actor from maliciously complying with a robots.txt file by just going to the internet archive instead.

>Well TIL that IA does not respect robots.txt.

At least, that's what they say[0].

>What would stop an actor from maliciously complying with a robots.txt file by just going to the internet archive instead.

Nothing; as far as I understand scraping public web is legal or that's what courts are saying lately. Btw, it's mind boggling to be me that after 30 years of commercial Internet and Web, we still don't have a definite answer is scraping of public websites and public web content legal or illegal.

[0] https://blog.archive.org/2017/04/17/robots-txt-meant-for-sea...

> Nothing; as far as I understand scraping public web is legal or that's what courts are saying lately. Btw, it's mind boggling to be me that after 30 years of commercial Internet and Web, we still don't have a definite answer is scraping of public websites and public web content legal or illegal.

I was more thinking from a public perception side instead of legal, but legal would be a good question too.

Something like, "Yeah I totally respected your robots.txt file the only reason I have your data is because I crawled IA, see they are the ones you should be mad at not us"

This is about crawling for training data by the look of things. Not sure if the CHatGPT browsing mode uses a different user-agent but most of the entries in that list look like crawlers.
I had assumed this is related to sites like chatgpt going out and searching with a specific request.

Regardless, my original question is still valid. The companies have already shown a lack of care about the data they train off of. So if ethics have already gone out the window, what is to stop them from ignoring this file if they are not already.