Y
Hacker News
new
|
ask
|
show
|
jobs
by
samus
229 days ago
This is a Mixture of Experts model with only 3B activated parameters. But I agree that for the intended usage scenario VRAM for the KV cache is the real limitation.