| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by zozbot234 49 days ago
	KV size correlates with attention parameters which are a subset of active parameters. So a typical MoE model will have way lower KV size than a dense model of equal total parameter count.