|
|
|
|
|
by 1R053
589 days ago
|
|
the paper with details: https://arxiv.org/pdf/2411.02265 They use - 16 experts, of which one is activated per token - 1 shared expert that is always active in summary that makes around 52B active parameters per token instead of the 405B of LLama3.1. |
|