|
|
|
|
|
by chaos_emergent
739 days ago
|
|
When training a large model you're doing a single forward pass + backprop over multiple Infiniband-connected nodes for a single model instance, so if one node goes down it takes a logical unit of nodes down with it. For reference, GPT-4 was rumored to be around 1.7T, and doing some back-of-the-hand math[1], that's like 500-700 H100 GPUs per model instance, which means you need a multiple of that for any training parallelism whatsoever. [1] back-of-the-hand-math:
1.7T * 4 bytes = 6.8 TB;
3-4x that for activation + gradients = 27.2 TB;
27.2TB / (80GB / H100) = 349 H100s;
1.5-2x conservative multiplier accounting for not fully using node resources + memory overhead in the machine = ~500-700 H100s. truly insane numbers. |
|