| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by LinuxBender 1209 days ago

I think the best way to answer the question would be to test it out. Have ChatGPT learn something from a URL that is forbidden by robots.txt.

FWIW google does not respect robots.txt in the way people think they do. The will still crawl and index a resource but will not publicly display it. Same for archive.org. I've verified that numerous times. Let archive.org index a thing that has always been forbidden by robots.txt and then after some time take the site down. Once robots.txt is no longer reachable archive will start displaying content that was always forbidden per robots.txt. All bots follow the pirate code. A bot will do what a bot >can< do...

If a resource is meant to be less-than-public it must be behind authentication that bots can not bypass even with the assistance of a human using an addon. Translation addons or any addons using the cloud are an easy way to bypass authentication.

1 comments

JohnFen 1208 days ago

This is why I stopped relying on robots.txt a long time ago. I still use it, but I also have my server check the user agent for crawlers and return a 403 to them.

link