Y
Hacker News
new
|
ask
|
show
|
jobs
by
jbellis
4 days ago
Because you need kv proportional to context length
during inference of a single token
to avoid quadratic recomputation. So compressing the kv lets you handle longer contexts in the same amount of vram.