| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by alexeldeib 92 days ago
	KV cache is, well, a cache that can fill up and trigger eviction. You require enough space to execute at least 1 fwd pass of 1 request at your context length. KV cache hits reduce TTFT by avoiding prefill. You don’t get to skip decode. MoE is kinda related in terms of lower usage requirements vs a dense model of same total param size, but I think your mental model is a bit off.

1 comments

zozbot234 91 days ago

KV cache is also eminently swappable if you have fast storage, since it mostly sees small append-only writes per token - it's not rewritten continuously like the activations. (I believe it's even better if you use cached input tokens across requests, since that portion of KV cache can then be recycled and save a single ~KV-cache sized write per request.) Accessing swapped-out cache may be slow, but it's highly preferable to not having that cache amount at all and recomputing from scratch.

link