Hacker News new | ask | show | jobs
by oceanplexian 4 days ago
A lot of this is over my head but why would you do compression when GPU time is the most expensive thing in the world right now?

KV can be trivially stored on ram or even a spinning disk and retrieved on the order of milliseconds. See LM cache for vLLM for example. In fact it’s so easy it kinda shocks me when Claude Code will sit and recompute my entire KV on a new session after a couple of hours, I guess Anthropic infra is not as optimized as it would seem.

Think about the problem from first principles:

Storing a few GB per user at scale isn’t that hard and was solved years ago. Let’s say I have 20 chat sessions open and the session persists for a day or two, this seems negligible to me as a systems design problem.

6 comments

I created a patch for llama.cpp to store on disk instead of deleting the kv cache as well as the checkpoints... there is this bug on llama.cpp if you have more than one instance going on of chats... and that causes the kv cache to be lost between changes of chat... And I can tell you, using Qwen3.627B after one day of use you can have 120-200Gb of chats on disk. And yes it's way way faster, even if you get it from a spinning disk it's still faster than re-computing the whole thing...

I guess for a 300B parameter or more and couple million users with the price of storage increasing as part of ramagedon this is also not viable...

Qwen 27B maxes out at a 16GB context. A nice thing about DeepSeek V4, especially Flash, is that its context size stays tiny even at 1M tokens! Which in turn opens up wide batching on common consumer platforms.
DeepSeek V4 Flash is 160GB while Qwen 27B is about 27GB. You can't even run DS Flash on consumer platforms, let alone batch it.
These are the sizes of model weights, not the KV cache. The former are a sparse (for MoE models) read workload that can be streamed from SSD.
You can't batch MoE
You need wider batches to get effective reuse of experts in any given layer, but you absolutely can. DeepSeek V4 has tiny KV caches that make this quite feasible. When targeting consumer platforms that only have a limited amount of compute headroom to begin with, the approach is quite reasonable.
While prefill is bottlenecked by GPU compute time, decode might be bottlenecked by GPU memory bandwidth, as you basically need to go through entire KV cache for each new token. So compression can make it faster - you will use more GPU compute but less memory bandwidth for attention calculation
Host to device bandwidth (ram to vram) is 128Gb/s for PCIe Gen 6. VRAM to GPU bandwidth is 1.8Tb/s for GDDR 7 (5090), and 8Tb/s for HBM3e (B200). So it can be faster to recompute than offload kv cache.
> a few GB per user at scale

While this might seem to be true for casual users, I recall that one of the reasons for Anthropic's recent changes for only retaining KV cache for an hour or so, was that many users just have one massive ongoing session that they continue on with multiple unrelated queries (as one would in a single-thread "group chat"). And this is hard to distinguish from someone who wants that context for their seemingly-unrelated query to apply tone etc.

So in practice, there are many casual users who are typing their Google-esque searches against a 100k+ token context window - and it's at that point where things balloon into 300GB+ KV caches to maintain.

I wouldn't be surprised if we see new UX's around subsidized plans starting to encourage resetting the context window more often.

300GB of context for a single session is huge though. Modern local models max out at a whole lot less than that.
edge
Because you need kv proportional to context length during inference of a single token to avoid quadratic recomputation. So compressing the kv lets you handle longer contexts in the same amount of vram.