|
|
|
|
|
by phire
434 days ago
|
|
It's not even per token. The routing happens once per layer, with the same token bouncing between layers. It's more of a performance optimization than anything else, improving memory liquidity. Except it's not an optimization for running the model locally (where you only run a single query at a time, and it would be nice to keep the weights on the disk until they are relevant). It's a performance optimization for large deployments with thousands of GPUs answering tens of thousands of queries per second. They put thousands of queries into a single batch and run them in parallel. After each layer, the queries are re-routed to the GPU holding the correct subset of weights. Individual queries will bounce across dozens of GPUs per token, distributing load. Even though the name "expert" implies they should experts in a given topic, it's really not true. During training, they optimize for making the load distribute evenly, nothing else. |
|
While current MoE implementations are tuned for load-balancing over large pools of GPUs, there is nothing stopping you tuning them to only switch expert once or twice per token, and ideally keep the same weights across multiple tokens.
Well, nothing stopping you, but there is the question of if it will actually produce a worthwhile model.