Hacker News new | ask | show | jobs
by machinelearning 793 days ago
Both RAG and infinite contexts in their current states are hacks.

Both waste compute because you have to re-encode things as text each time and RAG needs a lot of heuristics + a separate embedding model.

Instead, it makes a lot more sense to pre-compute KV for each document, then compute values for each query. Only surfacing values when the attention score is high enough.

The challenge here is to encode global position information in the surfaced values and to get them to work with generation. I suspect it can't be done out of the box but we it will work with training.

This approach has echoes of both infinite context length and RAG but is an intermediate method that can be parallelized and is more efficient than either one.

1 comments

uh yeah it works out of the box, this is how most RAG systems are designed, just look at pgvector for example.
Nope that’s not how most rag systems work today. I looked at pgvector and couldn’t find anything similar.

Do you have a link? Or maybe you misunderstood what I was taking about

Sorry for the late response. I must be misunderstanding your comment. I read your comment as "RAG doesn't pre-compute KV for each document, which is inefficient". With RAG, you convert your text into vectors and then store them in a DB — this is the pre-compute. Then you just need to compute the vector of your query, and search for vector similarity. So it seems to me like RAG doesn't suffer from inefficiency you were saying it suffers from.
No, you've only discussed the Retrieval part of RAG, not the generation part.

The current workflow is to use the embedding to retrieve documents then dump the text corresponding to the embedding into the LLM context for generation.

Often, the embedding is from a different model from the LLM and it is not compatible with the generation part.

So yea, RAG does not pre-compute the KV for each document.

I see what you're saying now, thanks for clarifying.