| HN Mirror

The best way to understand RAG is that it's a prompting hack where you increase the chance that a model will answer a question correctly by pasting a bunch of text that might help into the prompt along with their question.

The art of implementing RAG is deciding what text should be pasted into the prompt in order to get the best possible results.

A popular way to implement RAG is using similarity search via vector search indexes against embeddings (which I explained at length here: https://simonwillison.net/2023/Oct/23/embeddings/). The idea is to find the content that is semantically most similar to the user's question (or the likely answer to their question) and include extracts from that in the prompt.

But you don't actually need vector indexes or embeddings at all to implement RAG.

Another approach is to take the user's question, extract some search terms from it (often by asking an LLM to invent some searches relating to the question), run those searches against a regular full-text search engine and then paste results from those searches back into the prompt.

Bing, Perplexity, Google Gemini are all examples of systems that use this trick.