This is exciting news! What's also exciting is that it's not just C++ that can run on this supercomputer; there is also good (currently unofficial) support for programming those GPUs from Julia, via the AMDGPU.jl library (note: I am the author/maintainer of this library). Some of our users have been able to run AMDGPU.jl's testsuite on the Crusher test system (which is an attached testing system with the same hardware configuration as Frontier), as well as their own domain-specific programs that use AMDGPU.jl.
What's nice about programming GPUs in Julia is that you can write code once and execute it on multiple kinds of GPUs, with excellent performance. The KernelAbstractions.jl library makes this possible for compute kernels by acting as a frontend to AMDGPU.jl, CUDA.jl, and soon Metal.jl and oneAPI.jl, allowing a single piece of code to be portable to AMD, NVIDIA, Intel, and Apple GPUs, and also CPUs. Similarly, the GPUArrays.jl library allows the same behavior for idiomatic array operations, and will automatically dispatch calls to BLAS, FFT, RNG, linear solver, and DNN vendor-provided libraries when appropriate.
I'm personally looking forward to helping researchers get their Julia code up and running on Frontier so that we can push scientific computing to the max!
There's a bit of drama in that there are unofficial reports of two systems in China with higher performance [0], the arXiv paper listed below talks about a 40 million core system with around double theoretical performance than Frontier, and there's apparently a second system online with similar performance. I personally suspect that they didn't submit benchmarks to the top500 simply because those don't run well enough in the systems
I heard they won’t submit anymore so as to not draw further scrutiny and possible sanctions onto their suppliers. Not sure if true, but keeping a low profile certainly makes sense given the blows dealt to the more visible vendors in the past few years.
As far as I know, Slingshot uses a layer 1 that is exactly the same as Ethernet and allows layer 2 ethernet packets to enter switches. However, it has several layer 1/2 extensions that let it look more like Infiniband to use cases that need it, including flow control and congestion control.
This must include the GPUs, otherwise it'd be 136,408 sockets. For a 42U rack with 4P 1U servers (not that that's what's in use, but to give an understandable napkin figure), that'd be 812 racks.
Frontier's own page says 74 "cabinets"/racks, and this is just for the compute (and perhaps switching and/or power? storage is elsewhere). Made up of 9408 nodes, with 4 MI250X gpu accelerators each- those accelerators being dual chip + 8x HBMe2 memory a piece monsters. From Anandtech[1], we can see the liquid-cooled half-width sleds are dual socket, and packed packed packed.
Back in the 2010 timeframe, there were articles about how an Exascale Supercomputers might be impossible. Would be interesting if someone could go back and assess where those predictions were wrong and where they held, and how the architecture changed to get around those true scaling limits.
This was a governmental report on the challenges of Exascale. Contributors included major universities and all the US supercomputing facilities. It wasn’t that they overlooked the possibility of Moore’s law continuing and associated power reduction.
I used to be really excited about supercomputers. It's part of why I pursued HPC-related work.
But I think that having no interest in their actual applications has curbed my enthusiasm. I wish I could make a good living in something that interested more.
I love the applications, but I'm dismayed at the stagnation in programming models used to get the best performance out of modern clusters. This sums up my feelings:
The spec sheet mentions they're moving from CUDA powering their prior supercomputer to "HIP" for this one. This is the first I've heard of HIP, does anyone have experience with it? My impression was that GPU programming tended to mean CUDA, which isn't cross platform (as opposed to HIP).
My experience is that the stack is pretty rough around the edges. But when it works, you (almost) literally find-and-replace, and it pretty much works as advertised. However, just because you can get to a correct code doesn't necessarily mean that code will achieve optimal performance (without further tuning, of course).
I remember in early 2000's trying to convince people to use linux and being mocked that it was a "toy" or "not professional enough". While at the time I tried to argue how it was more stable, more secure and better performant than competition and even arguing that it was improving continuously, some people still made fun of me. It is a good thing I've been able, for a long time, to see this: https://www.top500.org/statistics/list/ , chose Category:"Operating system family" and click "Submit".
That was an odd qualification. The only thing they mention is that the #2 computer in Japan is theoretically capable of an Exaflop, but hasn’t demonstrated it yet.
For reference, Roadrunner, which was the first petascale system in 2008, used 2.35 MW (according to Wikipedia). So this one gives us 1,000 times the performance for 10 times the energy. From a performance/Watt perspective, this is an impressive improvement.
EDIT: Wikipedia also says Roadrunner was not considered power-efficient in its day, which led to it being decommissioned after only five years of operation.
That, and the apps teams hated the architecture given very poor tooling support - especially since the writing was on the wall that the future was GPGPU accelerators and the Cell was a dead end. The roadrunner processors were awesome on paper, but not so much when it came to working with them. Kind of a shame really: there were some really interesting ideas in that processor design.
This is the first I have noticed them reporting power draw. It seems immoral to run it for anything that doesn't help stop global climate catastrophe. (Presumably global thermonuclear war would suffice. But carbon capture afterward would be hard to arrange for.)
Wondering if they measure while benchmarking, or add up max power ratings of the chips.
Did any old mainframe ever burn like that? E.g. the first big USAF missile tracking system, the one that filled four floors of a custom building?
Using the secret skill of "clicking on the links for the other lists" I discovered that the first TOP500 which had a machine report a power draw was the TOP10 in November 1996: https://top500.org/lists/top500/1996/11/
(498 kW for 229 GFlops. 136,317 times more power draw per flops than the current leader on the Green500.)
Anybody not boggled over the idea of their program burning 21 MW for as long as it runs either can never be impressed with anything, or has no head for numbers. Computation demanding power that could hurl a fully loaded jumbo jet into the sky is not something most of us will ever code.
If they truly wanted to solve world problems, they need to allow an AGI company like DeepMind or OpenAI to use it. The people now using it are likely wasting so much money using outdated technologies.
As in Vernor Vinge's "A Fire upon the Deep", where Powers in the Great Beyond transcend existence while regular people are condemned to live out their lives running a trivial program.
It feels like it's been a long time since supercomputers were interesting. They're just oodles of identical processors connected together like legos. "We can afford more bricks than the next guy" is not exciting. When was the last time we had a "fastest supercomputer" that could do something the second-fastest couldn't also do?
Speed is just the measure of how fast it does something not a measure of what it's capable of doing. I wouldn't expect to divine more information like "what new things can it do" from that number alone outside "things we didn't have enough compute time for before we do now".
Lego style supercomputers are still very interesting in my eye though. As the technical complexity involved in scaling the raw compute performance has simplified to a "how many do you want" problem the technical complexity in the interconnects has remained interesting and innovative both for connectivity intra and inter node. You won't really see that in the FLOPS number that makes the headlines but the interconnect can be the difference between a type of workload being feasible or not. The main push here is how large you can make certain levels of shared memory access happen at what latencies to run larger jobs instead of just more jobs.
Well fundamentally all super computers are turing machines. So one can do X while Y can not doesn't really make sense in that context.
However the second-fastest (ARM based Fugaku) absolutely wipes the floor with the fastest in certain tasks due to a difference in interconnect topology. Fugaku futhermore has no GPUs unlike many other super computers and instead a CPU with some vector instructions, leading to a different programming model.
If you are more into specialized hardware, Anton3 is amazing.
I also wonder how the software side of things changes in those settings, how do people design program / algorithms around fast and wide data path like these.
I have a bit of experience programming for a highly-parallel supercomputer, specifically in my case an IBM BlueGene/Q. In that case, the answer is a lot of message passing (we used Open MPI [0]). Since the nodes are discrete and don't have any shared memory, you end up with something kinda reminiscent of the actor model as popularized by Erlang and co -- but in C for number-crunching performance.
That said, each of the nodes is itself composed of multiple cores with shared memory. So in cases where you really want to grind out performance, you actually end up using message passing to divvy up chunks of work, and then use classic pthreads to parallelize things further, with lower latency.
I forget the exact terminology used, but the parent is right that the interconnect is the "killer feature." To make that message passing fast, there's a lot of crazy topography to keep the number of hops down. The Q had nodes connected in a "torus" configuration to that end [1].
Debugging is a bit of a nightmare, though, since some bugs inevitably only come up once you have a large number of nodes running the algorithm in parallel. But you'll probably be in a mainframe-style time-sharing setup, so you may have to wait hours or more to rerun things.
This applies less to some of the newer supercomputers, which are more or less clusters of GPUs instead of clusters of CPUs. I imagine there's some commonality, but I haven't worked with any of them so I can't really say.
Building the communication fabric it takes to make those oodles of identical processors to exchange and share data quickly so they don't get bogged down in their own communication overhead is a profoundly interesting problem, and by "profoundly interesting" I mean "call Richard Feynman in to help you solve it":
Besides which, at that level the goal is not to go "look at this cool thing we built", it's more like "how do we cheaply and effectively build something that can solve these massive weather/nuclear explosion/human brain/etc. simulation problems we have?" and if ganging together lots of off-the-shelf CPUs/GPUs achieves that goal with less time, effort, and cost than building super-custom, boutique-schmoutique hardware, so be it.
Not sure about exciting, but I'd think the technical challenges, particularly regarding intra-cluster communication, can be interesting to some. There's a lot of money in it, they better do something useful (more useful then running Linpack or calculating digits of Pi), rather then being just show cases.
Said that, #1 is about twice as fast as #2, which is about three times as fast as number #3. Those gaps are much wider then I would have expected this late in the game.
What's nice about programming GPUs in Julia is that you can write code once and execute it on multiple kinds of GPUs, with excellent performance. The KernelAbstractions.jl library makes this possible for compute kernels by acting as a frontend to AMDGPU.jl, CUDA.jl, and soon Metal.jl and oneAPI.jl, allowing a single piece of code to be portable to AMD, NVIDIA, Intel, and Apple GPUs, and also CPUs. Similarly, the GPUArrays.jl library allows the same behavior for idiomatic array operations, and will automatically dispatch calls to BLAS, FFT, RNG, linear solver, and DNN vendor-provided libraries when appropriate.
I'm personally looking forward to helping researchers get their Julia code up and running on Frontier so that we can push scientific computing to the max!
Library link: <https://github.com/JuliaGPU/AMDGPU.jl>