| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by jbellis 51 days ago
	Because you need kv proportional to context length during inference of a single token to avoid quadratic recomputation. So compressing the kv lets you handle longer contexts in the same amount of vram.