|
|
|
|
|
by oceanplexian
4 days ago
|
|
A lot of this is over my head but why would you do compression when GPU time is the most expensive thing in the world right now? KV can be trivially stored on ram or even a spinning disk and retrieved on the order of milliseconds. See LM cache for vLLM for example. In fact it’s so easy it kinda shocks me when Claude Code will sit and recompute my entire KV on a new session after a couple of hours, I guess Anthropic infra is not as optimized as it would seem. Think about the problem from first principles: Storing a few GB per user at scale isn’t that hard and was solved years ago. Let’s say I have 20 chat sessions open and the session persists for a day or two, this seems negligible to me as a systems design problem. |
|
I guess for a 300B parameter or more and couple million users with the price of storage increasing as part of ramagedon this is also not viable...