|
|
|
|
|
by alexeldeib
92 days ago
|
|
KV cache is, well, a cache that can fill up and trigger eviction. You require enough space to execute at least 1 fwd pass of 1 request at your context length. KV cache hits reduce TTFT by avoiding prefill. You don’t get to skip decode. MoE is kinda related in terms of lower usage requirements vs a dense model of same total param size, but I think your mental model is a bit off. |
|