Hacker News new | ask | show | jobs
by guntars 502 days ago
Since it's a MoE model with 37B active params, I imagined you don't even need all of that ram to keep the whole model in memory, just the active bits.
1 comments

The active bits may change with each token. You need the whole model in memory, even though, for any single token, only a subset of that memory will have been used in computation. The memory efficiency comes when you have multiple sessions in parallel.