|
|
|
|
|
by p1esk
644 days ago
|
|
Do you mean this supercomputer has slower internode links? What are its links? For example, xAI just brought up 100k GPU cluster, most likely with 800Gbps internode links, or maybe even double that. I think the main difference is in the target numerical precision: supercomputers such as this one focus on maximizing FP64 throughput, while GPU clusters used by OpenAI or xAI want to compute in 16 or even 8 bit precision (BF16 or FP8). |
|
Google style infrastructure uses aggregation trees. This works well for fan out fan back in communication patterns, but has limited bisection bandwidth at the core/top of the tree. This can be mitigated with clos networks / fat trees, but in practice no one goes for full bisection bandwidth on these systems as the cost and complexity aren't justified.
HPC machines typically use torus topology variants. This allows 2d and 3d grid style computations to be directly mapped onto the system with nearly full bisection bandwidth. Each smallest grid element can communicate directly with its neighbors each iteration, without going over intermediate switches.
Reliability is handled quite a bit different too. Google style infrastructure does this with elaborations of the map reduce style: spot the stranglers or failures, reallocate that work via software. HPC infrastructure puts more emphasis on hardware reliability.
You're right that F32 and F64 performance are more important on HPC, while Google apps are mostly integer only, and ML apps can use lower precision formats like F16.