Hacker News new | ask | show | jobs
by jasonwatkinspdx 649 days ago
It's not just about the link speeds, it's about the topologies used.

Google style infrastructure uses aggregation trees. This works well for fan out fan back in communication patterns, but has limited bisection bandwidth at the core/top of the tree. This can be mitigated with clos networks / fat trees, but in practice no one goes for full bisection bandwidth on these systems as the cost and complexity aren't justified.

HPC machines typically use torus topology variants. This allows 2d and 3d grid style computations to be directly mapped onto the system with nearly full bisection bandwidth. Each smallest grid element can communicate directly with its neighbors each iteration, without going over intermediate switches.

Reliability is handled quite a bit different too. Google style infrastructure does this with elaborations of the map reduce style: spot the stranglers or failures, reallocate that work via software. HPC infrastructure puts more emphasis on hardware reliability.

You're right that F32 and F64 performance are more important on HPC, while Google apps are mostly integer only, and ML apps can use lower precision formats like F16.

1 comments

Almost no modern systems are running Torus these days - at least not at the node level. The backbone links are still occasionally designed that way, although Dragonfly+ or similar is much more common and maps better onto modern switch silicon.

You're spot on that the bandwidth available in these machines hugely outstrips that in common cloud cluster rack-scale designs. Although full bisection bandwidth hasn't been a design goal for larger systems for a number of years.

LambdaLabs GPU cluster provides internode bandwidth of 3.2Tbps: I personally verified it in a cluster of 64 nodes (8xH100 servers) and they claim it holds for up to 5k GPU cluster. What is the internode bandwidth of Frontier? Someone claimed it's 200Gbps, which, if true, would be a huge bottleneck for some ML models.
Frontier is 4x 200Gbps links per node into the interconnect. The interconnect is designed for 540TB/s of bisection bandwidth. <https://icl.utk.edu/files/publications/2022/icl-utk-1570-202...>

Bisection bandwidth is the metric these systems will cite, and impacts how the largest simulations will behave. Inter-node bandwidth isn't a direct comparison, and can be higher at modest node counts as long as you're within a single switch. I haven't seen a network diagram for LambdaLabs, but it looks like they're building off 200Gbps Infiniband once you get outside of NVLink. So they'll have higher bandwidth within each NVLink island, but the performance will drop once you need to cross islands.

I thought NVLink is only for communication between GPUs within a single node, no? I don't know what the size of their switches are, but I verified that within a 64 node cluster I got the full advertised 3.2Tbps bandwidth. So that's 4x as fast as 4x200Gbps, but 800Gbps is probably not a bottleneck for any real world workload.
It's 200 Gbps per port, per direction. That's the same as the Nvidia interconnect lambdalabs uses.