| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by 2001zhaozhao 49 days ago
	This is 128B dense though. the K/V cache on long context is going to be massive

2 comments

Havoc 49 days ago

Don’t think kv size correlates to dense/moe

link

zozbot234 49 days ago

KV size correlates with attention parameters which are a subset of active parameters. So a typical MoE model will have way lower KV size than a dense model of equal total parameter count.

link

syntaxing 49 days ago

With turbo quant, you would reduce it by over 6X.

link