Hacker News new | ask | show | jobs
by p1esk 644 days ago
Do you mean this supercomputer has slower internode links? What are its links? For example, xAI just brought up 100k GPU cluster, most likely with 800Gbps internode links, or maybe even double that.

I think the main difference is in the target numerical precision: supercomputers such as this one focus on maximizing FP64 throughput, while GPU clusters used by OpenAI or xAI want to compute in 16 or even 8 bit precision (BF16 or FP8).

2 comments

It's not just about the link speeds, it's about the topologies used.

Google style infrastructure uses aggregation trees. This works well for fan out fan back in communication patterns, but has limited bisection bandwidth at the core/top of the tree. This can be mitigated with clos networks / fat trees, but in practice no one goes for full bisection bandwidth on these systems as the cost and complexity aren't justified.

HPC machines typically use torus topology variants. This allows 2d and 3d grid style computations to be directly mapped onto the system with nearly full bisection bandwidth. Each smallest grid element can communicate directly with its neighbors each iteration, without going over intermediate switches.

Reliability is handled quite a bit different too. Google style infrastructure does this with elaborations of the map reduce style: spot the stranglers or failures, reallocate that work via software. HPC infrastructure puts more emphasis on hardware reliability.

You're right that F32 and F64 performance are more important on HPC, while Google apps are mostly integer only, and ML apps can use lower precision formats like F16.

Almost no modern systems are running Torus these days - at least not at the node level. The backbone links are still occasionally designed that way, although Dragonfly+ or similar is much more common and maps better onto modern switch silicon.

You're spot on that the bandwidth available in these machines hugely outstrips that in common cloud cluster rack-scale designs. Although full bisection bandwidth hasn't been a design goal for larger systems for a number of years.

LambdaLabs GPU cluster provides internode bandwidth of 3.2Tbps: I personally verified it in a cluster of 64 nodes (8xH100 servers) and they claim it holds for up to 5k GPU cluster. What is the internode bandwidth of Frontier? Someone claimed it's 200Gbps, which, if true, would be a huge bottleneck for some ML models.
Frontier is 4x 200Gbps links per node into the interconnect. The interconnect is designed for 540TB/s of bisection bandwidth. <https://icl.utk.edu/files/publications/2022/icl-utk-1570-202...>

Bisection bandwidth is the metric these systems will cite, and impacts how the largest simulations will behave. Inter-node bandwidth isn't a direct comparison, and can be higher at modest node counts as long as you're within a single switch. I haven't seen a network diagram for LambdaLabs, but it looks like they're building off 200Gbps Infiniband once you get outside of NVLink. So they'll have higher bandwidth within each NVLink island, but the performance will drop once you need to cross islands.

I thought NVLink is only for communication between GPUs within a single node, no? I don't know what the size of their switches are, but I verified that within a 64 node cluster I got the full advertised 3.2Tbps bandwidth. So that's 4x as fast as 4x200Gbps, but 800Gbps is probably not a bottleneck for any real world workload.
It's 200 Gbps per port, per direction. That's the same as the Nvidia interconnect lambdalabs uses.
Each node has 4 GPUs, and each of those has a dedicated network interface card capable of 200 Gbps each way. Data can move right from one GPU's memory to another. But it's not just bandwidth that allows the machine to run so well, it's a very low-latency network as well. Many science codes require very frequent synchronizations, and low latency permits them to scale out to tens of thousands of endpoints.
200 Gbps

Oh wow, that’s pretty bad.

That's 200Gbps from that card to any other point in the other 9,408 nodes in the system. Including file storage.

Within the node, bandwidth between the GPUs is considerably higher. There's an architecture diagram at <https://docs.olcf.ornl.gov/systems/frontier_user_guide.html> that helps show the topology.

I see, OK, I misinterpreted it as per node bandwidth. Yes, this makes more sense, and is probably fast enough for most workloads.