Hacker News new | ask | show | jobs
by zozbot234 116 days ago
Should be active param size, not model size.
1 comments

Yes, you’re right.

LLama 3.1 however is not MoE, so all params are active.

For MoE it is tricky, because for each token you only use a subset of params (an “expert”) but you don’t know which one, so you have to keep them all in memory or wait until it loads from slower storage, potentially different for each token.