Hacker News new | ask | show | jobs
by onehp 842 days ago
Retrieval augmented generation. In short you use an LLM to classify your documents (or chunks from them) up front. Then when you want to ask the LLM a question you pull the most relevant ones back to feed it as additional context.
2 comments

I dont get it. To my understanding it takes huge amounts of data to build any any form of RAG. Simply because it enlarges the statistical model you later prompt. If the model is not big enough how would you expect it to answer you in a non qualifying matter ? It simply can't.

So I don't really buy it and I have yet to see it work better than any rdbms search index.

Tell me I am wrong, I would like to see a local model based on my own docs being able to answer me quality answers based on quality prompts.

RAG doesn't require much data or involve any training, it is a fancy name for "automatically paste some relevant context into the prompt"

Basically if you have a database of three emails and ask when Biff wanted to meet for lunch, a RAG system would select the most relevant email based on any kind of search - embeddings are most fashionable, and create a prompt like

"""Given this document: <your email>, answer the question "When does Biff want to meet for lunch?"""

That's not how RAG works. What you're describing is something closer to prompt optimization.

Sibling comment from discordance has a more accurate description of RAG. There's a longer description from Nvidia here: https://blogs.nvidia.com/blog/what-is-retrieval-augmented-ge...

Right, you read something nebulous about how "the LLM combines the retrieved words and its own response to the query into a final answer it presents to the user", and you think there is some magic going on, and then you click one link deeper and read at https://ai.meta.com/blog/retrieval-augmented-generation-stre... :

> Given the prompt “When did the first mammal appear on Earth?” for instance, RAG might surface documents for “Mammal,” “History of Earth,” and “Evolution of Mammals.” These supporting documents are then concatenated as context with the original input and fed to the [...] model

Finding the relevant context to put in the prompt is a search problem, nearest neighbour search on embeddings is one basic way to do it but the singular focus on "vector databases" is a bit of hype phenomenon IMO - a real world product should factor a lot more than just pure textual content into the relevancy score. Or is your personal AI assistant going to treat emails from yesterday as equally relevant as emails from a year ago?

Legit explanation, that's how it works AFAIK.
RAG:

1. First you create embeddings from your documents

2. Store that in a vector db

3. Ask what the user wants and do a search in the vector db (cosine similarity etc)

4. Feed the relevant search results to your LLM and do the usual LLM stuff with the returned embeddings and chunks of the documents

Although RAG is often implemented via vector databases to find 'relevant' content, I'm not sure that's a necessary component. I've been doing what I call RAG by finding 'relevant' content for the current prompt context via a number of different algorithms that don't use vectors.

Would you define RAG only as 'prompt optimisation that involves embeddings'?

Sure thing, your RAG approach sounds intriguing, especially since you're sidestepping vector databases. But doesn't the input context length cap affect it? (chatgpt plus at 32K [0] or gpt4 via open ai at 128K [1]) Seems like those cases would be pretty rare though.

[0]: https://openai.com/chatgpt/pricing#:~:text=8K-,32K,-32K

[1]: https://platform.openai.com/docs/models/gpt-4-and-gpt-4-turb...

Yes, context window is a limiting factor, but that's true however you identify the content to augment generation.
You're misunderstanding. Imagine your query is matched against chunks of text from the database, where the relevance of information is evaluated for the window each time it slides. Then collecting the n most relevant chunks, these are included in the prompt so then the llm can provide answers from source documents verbatim. This is useful for cases where precise and exact answers are needed. For example, searching the docs for some package for the right api to call. You don't mant a name that's close to right it has to be correct to the character.
Ahh ok I see. It's basically what MS CoPilot 365 does too, with its "Grounding" step.
Yes.