Hacker News new | ask | show | jobs
by boroboro4 443 days ago
Offloading in this case doesn’t mean keeping the kv cache on the disk/in storage all the time, it means keeping it there when request isn’t in process of generation. While request being generated kv cache is indeed in vram.

As for MLA - Deepseek is, just like others, attend to all historical tokens. The only difference instead of having actual KV entries it has lower dimension KV entries, which are being projected into full blown KV entries on the fly during attention. It’s similar to GQA, just instead of just duplication KV entries by size of groups it applies linear transformation.

1 comments

Ah OK. So this is for resuming chat context cheaply. What I said is still correct - 3FS is not part of the inference flow & not relevant to the paper which is about optimizing the KV cache usage at runtime.