Hacker News new | ask | show | jobs
by phillipcarter 841 days ago
FWIW the use case you're describing is very often achievable with RAG. Embedding models are deterministic, so while you're still limited by the often-nondeterministic nature of the LLM, in practice you can usually get the same answer for the same input. And it's substantially cheaper to do.
1 comments

With 1M tokens, if snapshotting the LLM state is cheap, it would beat out-of-the-box nearly all RAG setups, except the ones dealing with large datasets. 1M tokens is a lot of docs.
Yeah, but latency is still a factor here. Any follow-up question requires re-scanning the whole context, which often takes a long time. IIRC when Google showed their demos for this use case each request took over 1 minute for ~650k tokens.