Hacker News new | ask | show | jobs
by samus 229 days ago
This is a Mixture of Experts model with only 3B activated parameters. But I agree that for the intended usage scenario VRAM for the KV cache is the real limitation.