| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by eric-burel 320 days ago
	The first sentence of the article is literally wrong as it conflates LLM and the search part of a RAG (retrieval augmented generation, when you mix a web search and an LLM). Blocking bots cuts you off from the next-generation search, because it cuts you off from search at all. So far, blocking LLM simply prevents you from being part of the training dataset, which is not the same thing. Please stop upvoting such bad content it really makes Hackernews a terrible place for staying informed on LLMs.

1 comments

skeledrew 320 days ago

> blocking LLM simply prevents you from being part of the training dataset

That's narrow. Perplexity, and other LLM agent services, do perform a regular web search to gain context, before generation their output. How else would they have access to recent data when the underlying LLM's knowledge cutoff is usual at least a few weeks?

link

eric-burel 320 days ago

They are normal scrapers nothing specific to LLM as they are not yet used for training an LLM, unless I miss something from their architecture. So I don't get why they would be called LLM crawlers, when they are search engine crawlers. At least they could be called RAG crawlers for better nuance. The article linked in the post first sentence is more precise as it deals with scrapers: https://techcrunch.com/2025/08/04/perplexity-accused-of-scra... Some people may be ok with search engines but not LLM training so it's not the same deal.

link

skeledrew 320 days ago

Crawling is done to discover and index content for search results (to relieve dependence on Google, etc). Scraping is done to get relevant content into the LLM's context window. And then the LLM generates the output. All the functions are there, so someone may emphasize just a subset to try making their point (which can cause issues if relevant context is left out, whether accidentally, ignorantly or maliciously).

> RAG crawlers

Very few people know what "RAG" is, so it makes little sense to mention it to any other than a technical audience.

> not LLM training

There's an issue of trust, because once content is scraped, it can also be used to train future models. That's really what ought to be emphasized IMO.

link

eric-burel 318 days ago

Your answer is definitely clearer than the article, I get the point better, thanks for the feedback.

link