| HN Mirror

Yes, you’re right.

LLama 3.1 however is not MoE, so all params are active.

For MoE it is tricky, because for each token you only use a subset of params (an “expert”) but you don’t know which one, so you have to keep them all in memory or wait until it loads from slower storage, potentially different for each token.