Hacker News new | ask | show | jobs
by jbellis 4 days ago
Because you need kv proportional to context length during inference of a single token to avoid quadratic recomputation. So compressing the kv lets you handle longer contexts in the same amount of vram.