|
|
|
|
|
by loeg
740 days ago
|
|
This is really unique to the AI training clusters for reasons I'm not super clear on. Most other types of horizontally scaled workloads can sort of tolerate a slightly underperforming host, or hosts going bad every so often, with little P99/P99.9 impact. For some reason, AI training workloads really cannot. |
|
[1] back-of-the-hand-math: 1.7T * 4 bytes = 6.8 TB; 3-4x that for activation + gradients = 27.2 TB; 27.2TB / (80GB / H100) = 349 H100s; 1.5-2x conservative multiplier accounting for not fully using node resources + memory overhead in the machine = ~500-700 H100s.
truly insane numbers.