|
|
|
|
|
by tomp
434 days ago
|
|
> individual tokens are routed to different experts that was AFAIK (not an expert! lol) the traditional approach but judging by the chart on LLaMa4 blog post, now they're interleaving MoE models and dense Attention layers; so I guess this means that even a single token could be routed through different experts at every single MoE layer! |
|