Ask HN: Does ChatGPT respect Robots.txt? | HN Mirror

Y	Hacker News new \| ask \| show \| jobs

Ask HN: Does ChatGPT respect Robots.txt?

36 points by iKevinShah 1207 days ago

Hello HN,

Was looking for more info on whether ChatGPT (and similar LLMs) respect the robots.txt directives?

I couldn't find any details over the crawlers and the rules they follow when ChatGPT was undergoing its training.

When I asked ChatGPT - is says "As an AI language model, I do not have the ability to crawl the web on my own. However, as a general principle, web crawlers should follow the rules specified in the website's" and when asked further on who created the data set and method they used - The response is "The dataset used to train me was created by OpenAI, the organization that developed and maintains my system. OpenAI's team of researchers and engineers collected the training data from a wide variety of sources, including books, articles, websites, and other publicly available text data."

So - No clear answer as far as I can tell. It is obviously a near-impossible task to build the dataset manually (given the strength of the team) - Some crawlers would have been used (assumption) - If anyone knows or can shed light to it - It would be great.

8 comments

wildpeaks 1207 days ago

Let's not forget that ChatGPT can lie: just because it says something doesn't make it true.

It's more realistic to assume that any data a company is able to access will get gobbled up sooner or later because there is no real penalty for ignoring robots.txt or licenses at their scale: even if someone were to notice an infraction and has enough money to sue them for years, they can afford it and brush it off as the cost of doing business (and if it's not ChatGPT, then another model, the cat's out of the bag now).

A robots.txt gives as much protection as a "please do not hack me" text file against a ransonware.

paulcole 1207 days ago

In a way it’s even worse. Listing the stuff you don’t want crawled might be more like a text file with a list of vulnerabilities that hackers shouldn’t use against you.

iKevinShah 1206 days ago

Considering that but about robots.txt is true (and I feel it is true) what can one do. Are there no regulations (implemented or in planning stage) on any of the bodies which decide the standards?

At some point, content owner should be - technically - be having some control to be able to limit / control who accesses their content

tlb 1207 days ago

It's trained on data crawled from the web, using CommonCrawl among other sources. See https://github.com/openai/following-instructions-human-feedb.... CommonCrawl certainly respects robots.txt, and I assume the other sources do too.

LinuxBender 1207 days ago

I think the best way to answer the question would be to test it out. Have ChatGPT learn something from a URL that is forbidden by robots.txt.

FWIW google does not respect robots.txt in the way people think they do. The will still crawl and index a resource but will not publicly display it. Same for archive.org. I've verified that numerous times. Let archive.org index a thing that has always been forbidden by robots.txt and then after some time take the site down. Once robots.txt is no longer reachable archive will start displaying content that was always forbidden per robots.txt. All bots follow the pirate code. A bot will do what a bot >can< do...

If a resource is meant to be less-than-public it must be behind authentication that bots can not bypass even with the assistance of a human using an addon. Translation addons or any addons using the cloud are an easy way to bypass authentication.

JohnFen 1206 days ago

This is why I stopped relying on robots.txt a long time ago. I still use it, but I also have my server check the user agent for crawlers and return a 403 to them.

amluto 1207 days ago

Why would one expect ChatGPT to know the answer to this question? ChatGPT “knows” what it was trained on. The training data is unlikely to include a definitive answer to your question. And ChatGPT is not currently smart enough to do the kind of analysis that would determine the answer, nor is it likely to be able to do the kind of queries that would be needed to figure it out.

gtirloni 1206 days ago

The training data could include internal docs that describe how it ignores or not the robots.txt file.

amluto 1206 days ago

If I were involved at OpenAI, I would not include the internal wiki, Slack archives, Dropbox folders, etc in the training data. While it would be highly entertaining, it would not be a good idea.

iKevinShah 1206 days ago

I agree on that - that private data (in a best case scenario) should not and would not be included in the training but there would be some parts of internal documents which would be public (lets say public website) - It is expected that chatGPT would know at least those ..

kbelder 1206 days ago

>"I asked ChatGTP..."

"On two occasions I have been asked, 'Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out?' I am not able rightly to apprehend the kind of confusion of ideas that could provoke such a question."

-Charles Babbage

notahacker 1207 days ago

I think the more important question is whether the notion of "indexing" for search purposes accurately describes what a corpus of data that never refers back to its source material does.

Ideally, you'd have an updated version of Robots.txt which specified whether it allowed agents to use content in training LLMs or not, which different content publishers would set differently for different reasons

gorbypark 1207 days ago

I don't know if they did their own crawling as well, but at least part of the training set for GPT-3 was Common Crawl data. You could look up if Common Crawl respects Robots.txt or not.

iKevinShah 1207 days ago

To give more context - I asked further:

me: If you did not crawl - where did you learn about the data? C: As an AI language model, my learning is based on a pre-existing dataset of text. This dataset is created by training me on a large corpus of text, which can include anything from books and articles to websites and social media posts. This training data is processed and used to develop a model that can understand natural language and generate responses to questions and queries.

It is important to note that my training data is not limited to any one source, and it may include information from a variety of sources and contexts. This means that I have been exposed to a wide range of information, including information from websites that I may not have crawled myself. However, the information that I provide is based solely on my training data and my ability to understand and interpret natural language.

me: The said dataset - Who created it. and were crawlers used for it? C: The dataset used to train me was created by OpenAI, the organization that developed and maintains my system. OpenAI's team of researchers and engineers collected the training data from a wide variety of sources, including books, articles, websites, and other publicly available text data.

In the process of collecting this data, web crawlers may have been used to collect text from websites. However, it's important to note that OpenAI and other responsible organizations adhere to ethical and legal standards when collecting data, and they respect the terms of service and other policies of websites that they crawl. This means that they take steps to ensure that they are not overloading servers with excessive requests or crawling sensitive information that should not be made public. Additionally, they make efforts to anonymize any data that is collected so that individual users or websites cannot be identified.

me: It is good to know that you respect the service and policies but is it documented anywhere publicly? C: Yes, OpenAI has publicly documented its policies and practices related to data collection, as well as its commitment to ethical and responsible AI development. These policies and practices are detailed on OpenAI's website, and the organization is transparent about its approach to data collection and AI development.

For example, OpenAI has published a set of ethical guidelines for AI development that emphasize transparency, fairness, and safety. The organization also participates in industry-wide initiatives focused on responsible AI development, such as the Partnership on AI and the Responsible AI License.

In addition, OpenAI has developed a number of tools and technologies designed to promote transparency and accountability in AI systems, including tools for interpreting and visualizing machine learning models, and technologies for measuring the potential impact of AI systems on society.

Overall, OpenAI is committed to responsible and ethical AI development, and the organization is transparent about its policies and practices related to data collection, AI development, and the responsible use of AI technologies.

Me: ok- can you share me the link for that? C: Certainly! Here is the link to OpenAI's website where you can find more information about the organization's policies and practices related to AI development, data collection, and responsible use of AI technologies:

https://openai.com/about/ethics-safety/

This page provides an overview of OpenAI's approach to responsible AI development and links to more detailed information about the organization's policies and practices, as well as its tools and technologies for promoting transparency and accountability in AI systems.

Me: That link is 404Error C: I'm sorry about that! It looks like the link may have changed. Here's the updated link to OpenAI's website where you can find information about the organization's ethics and safety policies:

https://openai.com/about/mission-and-values/ethics-and-safet...

This page provides an overview of OpenAI's approach to ethical and responsible AI development, including its commitment to transparency, safety, and avoiding potential negative impacts of AI technologies. It also includes links to more detailed information about the organization's policies and practices.

Me: That is 404 as well C: I apologize for the confusion! OpenAI has recently made some changes to its website, and some of the links may have changed. Here's the updated link to the page that provides an overview of OpenAI's mission, values, and ethics:

https://openai.com/about/mission/

From this page, you can access more information about the organization's values and ethics, including its commitment to transparency and responsible AI development. Additionally, you can explore the rest of the site to learn more about OpenAI's research, technology, and initiatives.

---

That is a404 too so I decided to ask experts here :)

m3047 1207 days ago

ChatGPT declaring someone dead and making up a legitimate-looking URL to a nonexistent obituary: https://www.theregister.com/2023/03/02/chatgpt_considered_ha...

ChatGPT also misrepresented its own terms and conditions in a deceptive fashion to a professional colleague who then relied on that advice in a LinkedIn post. Plenty of blame to go around on that one, but who's culpable? What's the remedy? How can they get that wrong?

wmedrano 1206 days ago

It seems like you are trying to use ChatGPT as a smart entity but it is basically advanced autocomplete. The links it provided are just what is likely to exist, but not what actually does exist