|
|
|
|
|
by WhitneyLand
512 days ago
|
|
Can you share at a high level how you run this model? We know it’s 671B params with each MOE node at 37B… If the GPUs have say, 140GB for an H200, then do you just load up as many nodes as will fit into a GPU? How much do interconnects hurt performance vs being able to load the model into a single GPU? |
|
There are two ways we can run it:
- 8xH200 GPU == 8x141GB == 1128 GB VRAM
- 16xH100 GPU == 8x80GB == 1280 GB VRAM
Within a single node (up to 8 GPUs) you don't see any meaningful hit from GPU-to-GPU communication.
More than that (e.g. 16xH100) requires multi-node inference which very few places have solved at a production-ready level, but it's massive because there are way more H100s out there than H200s.