Hacker News new | ask | show | jobs
by l9o 176 days ago
RAM requirements stay the same. You need all 358B parameters loaded in memory, as which experts activate depends on each token dynamically. The benefit is compute: only ~32B params participate per forward pass, so you get much faster tok/s than a dense 358B would give you.
1 comments

The benefit is also RAM bandwidth. That probably adds to the confusion, but it matters a lot for decode. But yes, RAM capacity requirements stay the same.