Hacker News new | ask | show | jobs
by 1R053 589 days ago
the paper with details: https://arxiv.org/pdf/2411.02265

They use

- 16 experts, of which one is activated per token

- 1 shared expert that is always active

in summary that makes around 52B active parameters per token instead of the 405B of LLama3.1.