| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by selcuka 272 days ago

No, they explicitly block Gemini as well:

    User-agent: Google-Extended
    Disallow: /

Gemini still uses the same user agent, but it has a different robots.txt entry (Google-Extended) [1]:

> Google-Extended is a standalone product token that web publishers can use to manage whether content Google crawls from their sites may be used for training future generations of Gemini models that power Gemini Apps and Vertex AI API for Gemini and for grounding (providing content from the Google Search index to the model at prompt time to improve factuality and relevancy) in Gemini Apps and Grounding with Google Search on Vertex AI.

[1] https://developers.google.com/search/docs/crawling-indexing/...

1 comments

simonw 272 days ago

Honestly I feel like "training" is a bit of a distraction at this point. For a lot of types of content RAG-style search is much more important.

I imagine many of the orgs that are blocking "training" don't understand the difference between training and inference-time tool-based context extension (which really needs an agreed upon name, it's hard to talk about right now).

link

selcuka 272 days ago

My understanding is that it also affects RAG ("grounding" in Google terminology):

> [...] and for grounding (providing content from the Google Search index to the model at prompt time to improve factuality and relevancy) in Gemini Apps and Grounding with Google Search on Vertex AI.

So they seem to be blocking both training and RAG while still allowing search engine indexing.

link