Hacker News new | ask | show | jobs
by eric-burel 320 days ago
The first sentence of the article is literally wrong as it conflates LLM and the search part of a RAG (retrieval augmented generation, when you mix a web search and an LLM). Blocking bots cuts you off from the next-generation search, because it cuts you off from search at all. So far, blocking LLM simply prevents you from being part of the training dataset, which is not the same thing. Please stop upvoting such bad content it really makes Hackernews a terrible place for staying informed on LLMs.
1 comments

> blocking LLM simply prevents you from being part of the training dataset

That's narrow. Perplexity, and other LLM agent services, do perform a regular web search to gain context, before generation their output. How else would they have access to recent data when the underlying LLM's knowledge cutoff is usual at least a few weeks?

They are normal scrapers nothing specific to LLM as they are not yet used for training an LLM, unless I miss something from their architecture. So I don't get why they would be called LLM crawlers, when they are search engine crawlers. At least they could be called RAG crawlers for better nuance. The article linked in the post first sentence is more precise as it deals with scrapers: https://techcrunch.com/2025/08/04/perplexity-accused-of-scra... Some people may be ok with search engines but not LLM training so it's not the same deal.
Crawling is done to discover and index content for search results (to relieve dependence on Google, etc). Scraping is done to get relevant content into the LLM's context window. And then the LLM generates the output. All the functions are there, so someone may emphasize just a subset to try making their point (which can cause issues if relevant context is left out, whether accidentally, ignorantly or maliciously).

> RAG crawlers

Very few people know what "RAG" is, so it makes little sense to mention it to any other than a technical audience.

> not LLM training

There's an issue of trust, because once content is scraped, it can also be used to train future models. That's really what ought to be emphasized IMO.

Your answer is definitely clearer than the article, I get the point better, thanks for the feedback.