|
|
|
|
|
by nine_k
247 days ago
|
|
A great post, it starts with this: TL;DR • MSI’s first paper, REFRAG, is about a new way to do RAG. • This slightly modified LLM converts most retrieved document chunks into compact, LLM-aligned chunk embeddings that the LLM can consume directly. • A lightweight policy (trained with RL) decides which chunk embeddings should be expanded back into full tokens under a budget; the LLM runs normally on this mixed input. • The net effect is far less KV cache and attention cost, much faster first-byte latency and higher throughput, while preserving perplexity and task accuracy in benchmarks. I wish more long posts followed this model of a scientific paper. |
|