|
|
|
|
|
by philipkiely
510 days ago
|
|
Yeah so MoE doesn't really come into play for production serving -- once you are batching your requests you hit every expert at a large enough batch size so you have to think about running the models as a whole. There are two ways we can run it: - 8xH200 GPU == 8x141GB == 1128 GB VRAM - 16xH100 GPU == 8x80GB == 1280 GB VRAM Within a single node (up to 8 GPUs) you don't see any meaningful hit from GPU-to-GPU communication. More than that (e.g. 16xH100) requires multi-node inference which very few places have solved at a production-ready level, but it's massive because there are way more H100s out there than H200s. |
|
In their V3 paper DeepSeek talk about having redundant copies of some "experts" when deploying with expert parallelism in order to account for the different amounts of load they get. I imagine it only makes a difference at very high loads, but I thought it was a pretty interesting technique.