| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by kristianp 1038 days ago
	IIRC LLMs also use common crawl data for training. Are they also blocking common crawl? Another thing is that chatgpt 4 can do live retrieval of websites in response to users questions. That is a different crawler doing that I imagine. Are they going to block that too?

1 comments

twelve40 1037 days ago

This. Unfortunately, there is common crawl, there is bing and a million of other ways they could hide/get the data from. Or, just ignore robots.txt, it's not like it's a very honest or transparent operation they run there.

link