| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by scarmig 238 days ago
	The report says that different media organizations dropped their robots.txt for the duration of the research to give LLMs access. I would expect this isn't the on-off switch they conceptualized, but I don't know enough about how different LLM providers handle news search and retrieval to say for sure.

1 comments

dylan604 238 days ago

Does it work like that though? How long does it take for AI bots to crawl sites and have the data added to the model currently being used? Am I wrong in thinking that it takes a lot longer for AI bot crawls to be available to the public than a typical search engine crawler?

link

rimeice 238 days ago

Bots could be crawlers gathering data to periodically be used as raw training data or the requests could just be from a web search agent of some form like ChatGPT finding latest news stories on topic X for example. I don’t know if robots.txt can distinguish between the two types of bot request or whether LLM providers even adhere to either.

link

jay_kyburz 238 days ago

Wow, Just reading the headline I had assumed they were giving the new article as a document, then asking it to summarize the the document given.

link