Hacker News new | ask | show | jobs
by kristianp 1038 days ago
IIRC LLMs also use common crawl data for training. Are they also blocking common crawl?

Another thing is that chatgpt 4 can do live retrieval of websites in response to users questions. That is a different crawler doing that I imagine. Are they going to block that too?

1 comments

This. Unfortunately, there is common crawl, there is bing and a million of other ways they could hide/get the data from. Or, just ignore robots.txt, it's not like it's a very honest or transparent operation they run there.