Hacker News new | ask | show | jobs
by cubefox 651 days ago
> With its nearly 38,000 GPUs, Frontier occupies a unique public-sector role in the field of AI research, which is otherwise dominated by industry.

Is it really realistic to assume that this is the "fastest supercomputer"? What are estimated sizes for supercomputers used by OpenAI, Microsoft, Google etc?

Strangely enough, the Nature piece only mentions possible secret military supercomputers, but not ones used by AI companies.

3 comments

There's a pretty big difference between the workloads that these supercomputers run, and those running big LLM models (to be clear, hyperscalars also often have "supercomputers" more like the DoE laboratories for rent).

AI models are trained using one of {Data parallelism, tensor parallelism, pipeline parallelism}. These all have fairly regular access patterns, and want bandwidth.

Traditional supercomputer loads {Typically MPI or SHMEM} are often far more variable in access pattern, and synchronization is often incredibly carefully optimized. Bandwidth is still hugely important here, but insane network switches and topologies tend to be the real secret sauce.

More and more these machines are built using commodity hardware (instead of stuff like Knight's Landing from Intel), but the switches and network topology are still often pretty bespoke. This is required for really fine-tuned algorithms like distributed LU factorization, or matrix multiplication algorithms like COSMOS. The hyperscalars often want insane levels of commodity hardware including network switches instead.

The AI supercomputers you're citing are getting a lot closer, but they are definitely more disaggregated than DoE lab machines by nature of the software they run.

Where can you learn more about supercomputing?
There is a difference between a supercomputer and just a large cluster of compute nodes: mainly this is in the bandwidth between the nodes. I suspect industry uses a larger number of smaller groups of highly-connected GPUs for AI work.
Do you mean this supercomputer has slower internode links? What are its links? For example, xAI just brought up 100k GPU cluster, most likely with 800Gbps internode links, or maybe even double that.

I think the main difference is in the target numerical precision: supercomputers such as this one focus on maximizing FP64 throughput, while GPU clusters used by OpenAI or xAI want to compute in 16 or even 8 bit precision (BF16 or FP8).

It's not just about the link speeds, it's about the topologies used.

Google style infrastructure uses aggregation trees. This works well for fan out fan back in communication patterns, but has limited bisection bandwidth at the core/top of the tree. This can be mitigated with clos networks / fat trees, but in practice no one goes for full bisection bandwidth on these systems as the cost and complexity aren't justified.

HPC machines typically use torus topology variants. This allows 2d and 3d grid style computations to be directly mapped onto the system with nearly full bisection bandwidth. Each smallest grid element can communicate directly with its neighbors each iteration, without going over intermediate switches.

Reliability is handled quite a bit different too. Google style infrastructure does this with elaborations of the map reduce style: spot the stranglers or failures, reallocate that work via software. HPC infrastructure puts more emphasis on hardware reliability.

You're right that F32 and F64 performance are more important on HPC, while Google apps are mostly integer only, and ML apps can use lower precision formats like F16.

Almost no modern systems are running Torus these days - at least not at the node level. The backbone links are still occasionally designed that way, although Dragonfly+ or similar is much more common and maps better onto modern switch silicon.

You're spot on that the bandwidth available in these machines hugely outstrips that in common cloud cluster rack-scale designs. Although full bisection bandwidth hasn't been a design goal for larger systems for a number of years.

LambdaLabs GPU cluster provides internode bandwidth of 3.2Tbps: I personally verified it in a cluster of 64 nodes (8xH100 servers) and they claim it holds for up to 5k GPU cluster. What is the internode bandwidth of Frontier? Someone claimed it's 200Gbps, which, if true, would be a huge bottleneck for some ML models.
Frontier is 4x 200Gbps links per node into the interconnect. The interconnect is designed for 540TB/s of bisection bandwidth. <https://icl.utk.edu/files/publications/2022/icl-utk-1570-202...>

Bisection bandwidth is the metric these systems will cite, and impacts how the largest simulations will behave. Inter-node bandwidth isn't a direct comparison, and can be higher at modest node counts as long as you're within a single switch. I haven't seen a network diagram for LambdaLabs, but it looks like they're building off 200Gbps Infiniband once you get outside of NVLink. So they'll have higher bandwidth within each NVLink island, but the performance will drop once you need to cross islands.

It's 200 Gbps per port, per direction. That's the same as the Nvidia interconnect lambdalabs uses.
Each node has 4 GPUs, and each of those has a dedicated network interface card capable of 200 Gbps each way. Data can move right from one GPU's memory to another. But it's not just bandwidth that allows the machine to run so well, it's a very low-latency network as well. Many science codes require very frequent synchronizations, and low latency permits them to scale out to tens of thousands of endpoints.
200 Gbps

Oh wow, that’s pretty bad.

That's 200Gbps from that card to any other point in the other 9,408 nodes in the system. Including file storage.

Within the node, bandwidth between the GPUs is considerably higher. There's an architecture diagram at <https://docs.olcf.ornl.gov/systems/frontier_user_guide.html> that helps show the topology.

I see, OK, I misinterpreted it as per node bandwidth. Yes, this makes more sense, and is probably fast enough for most workloads.
Microsoft has a system at current #3 spot on the Top500 list. It uses 14.4k Nividia H100s and got about 1/2 the flops of Frontier.

It’s the fastest publicly disclosed. As far as private concerns, I feel like a “prove it” approach is valid.

https://www.top500.org/lists/top500/2024/06/

This is interesting for a different reason too.. MS has 1/4 the number of nodes, while claiming 1/2 the performance. If it is were just numbers game, MS supercomputer has a much higher processor to performance ratio.