|
|
|
|
|
by vlovich123
115 days ago
|
|
MoE is not suited for paging because it’s essentially a random expert per token. It only improves throughput because you reduce the memory bandwidth requirements for generating a token since 1/n of the weights are accessed per token (but a different 1/n on each loop). Now shrinking them sure, but I’ve seen nothing that indicates you can just page weights in and out without cratering your performance like you would with a non MoE model |
|