Hacker News new | ask | show | jobs
by scarmig 238 days ago
The report says that different media organizations dropped their robots.txt for the duration of the research to give LLMs access.

I would expect this isn't the on-off switch they conceptualized, but I don't know enough about how different LLM providers handle news search and retrieval to say for sure.

1 comments

Does it work like that though? How long does it take for AI bots to crawl sites and have the data added to the model currently being used? Am I wrong in thinking that it takes a lot longer for AI bot crawls to be available to the public than a typical search engine crawler?
Bots could be crawlers gathering data to periodically be used as raw training data or the requests could just be from a web search agent of some form like ChatGPT finding latest news stories on topic X for example. I don’t know if robots.txt can distinguish between the two types of bot request or whether LLM providers even adhere to either.
Wow, Just reading the headline I had assumed they were giving the new article as a document, then asking it to summarize the the document given.