|
|
|
|
|
by phire
499 days ago
|
|
My understanding is with MoE (Mixture of Experts), you can and should shard it horizontally. The whole model is 600GB, but only 37GB is active during the evaluation of any single output token. So you can load a different active subset of the MoE into each 89GB GPU, sharding it across something like 32 different GPUs (or can you get away with less? Wouldn't be surprised if they can infer on 8x H800 gpus). Some parameters are common, others are independent. Queries can be dynamically routed between GPUs, potentially bouncing between GPUs as much as once per output token, depending on which experts they need to activate. Though, I suspect it's normal to stick on one MoE subset for several output tokens. This has a secondary benefit that as long as the routing distribution is random, queries should be roughly load balanced across all GPUs. |
|