Y
Hacker News
new
|
ask
|
show
|
jobs
by
2001zhaozhao
49 days ago
This is 128B dense though. the K/V cache on long context is going to be massive
2 comments
Havoc
49 days ago
Don’t think kv size correlates to dense/moe
link
zozbot234
49 days ago
KV size correlates with attention parameters which are a subset of active parameters. So a typical MoE model will have way lower KV size than a dense model of equal total parameter count.
link
syntaxing
49 days ago
With turbo quant, you would reduce it by over 6X.
link