| Hello HN, Was looking for more info on whether ChatGPT (and similar LLMs) respect the robots.txt directives? I couldn't find any details over the crawlers and the rules they follow when ChatGPT was undergoing its training. When I asked ChatGPT - is says "As an AI language model, I do not have the ability to crawl the web on my own. However, as a general principle, web crawlers should follow the rules specified in the website's" and when asked further on who created the data set and method they used - The response is "The dataset used to train me was created by OpenAI, the organization that developed and maintains my system. OpenAI's team of researchers and engineers collected the training data from a wide variety of sources, including books, articles, websites, and other publicly available text data." So - No clear answer as far as I can tell. It is obviously a near-impossible task to build the dataset manually (given the strength of the team) - Some crawlers would have been used (assumption) - If anyone knows or can shed light to it - It would be great. |
It's more realistic to assume that any data a company is able to access will get gobbled up sooner or later because there is no real penalty for ignoring robots.txt or licenses at their scale: even if someone were to notice an infraction and has enough money to sue them for years, they can afford it and brush it off as the cost of doing business (and if it's not ChatGPT, then another model, the cat's out of the bag now).
A robots.txt gives as much protection as a "please do not hack me" text file against a ransonware.