Hacker News new | ask | show | jobs
by xlayn 9 days ago
I created a patch for llama.cpp to store on disk instead of deleting the kv cache as well as the checkpoints... there is this bug on llama.cpp if you have more than one instance going on of chats... and that causes the kv cache to be lost between changes of chat... And I can tell you, using Qwen3.627B after one day of use you can have 120-200Gb of chats on disk. And yes it's way way faster, even if you get it from a spinning disk it's still faster than re-computing the whole thing...

I guess for a 300B parameter or more and couple million users with the price of storage increasing as part of ramagedon this is also not viable...

1 comments

Qwen 27B maxes out at a 16GB context. A nice thing about DeepSeek V4, especially Flash, is that its context size stays tiny even at 1M tokens! Which in turn opens up wide batching on common consumer platforms.
DeepSeek V4 Flash is 160GB while Qwen 27B is about 27GB. You can't even run DS Flash on consumer platforms, let alone batch it.
These are the sizes of model weights, not the KV cache. The former are a sparse (for MoE models) read workload that can be streamed from SSD.
You can't batch MoE
You need wider batches to get effective reuse of experts in any given layer, but you absolutely can. DeepSeek V4 has tiny KV caches that make this quite feasible. When targeting consumer platforms that only have a limited amount of compute headroom to begin with, the approach is quite reasonable.
Sounds like you're talking out of your butt instead of doing the math.