Hacker News new | ask | show | jobs
First true exascale supercomputer? (top500.org)
62 points by AliCollins 1441 days ago
15 comments

This is exciting news! What's also exciting is that it's not just C++ that can run on this supercomputer; there is also good (currently unofficial) support for programming those GPUs from Julia, via the AMDGPU.jl library (note: I am the author/maintainer of this library). Some of our users have been able to run AMDGPU.jl's testsuite on the Crusher test system (which is an attached testing system with the same hardware configuration as Frontier), as well as their own domain-specific programs that use AMDGPU.jl.

What's nice about programming GPUs in Julia is that you can write code once and execute it on multiple kinds of GPUs, with excellent performance. The KernelAbstractions.jl library makes this possible for compute kernels by acting as a frontend to AMDGPU.jl, CUDA.jl, and soon Metal.jl and oneAPI.jl, allowing a single piece of code to be portable to AMD, NVIDIA, Intel, and Apple GPUs, and also CPUs. Similarly, the GPUArrays.jl library allows the same behavior for idiomatic array operations, and will automatically dispatch calls to BLAS, FFT, RNG, linear solver, and DNN vendor-provided libraries when appropriate.

I'm personally looking forward to helping researchers get their Julia code up and running on Frontier so that we can push scientific computing to the max!

Library link: <https://github.com/JuliaGPU/AMDGPU.jl>

There's a bit of drama in that there are unofficial reports of two systems in China with higher performance [0], the arXiv paper listed below talks about a 40 million core system with around double theoretical performance than Frontier, and there's apparently a second system online with similar performance. I personally suspect that they didn't submit benchmarks to the top500 simply because those don't run well enough in the systems

[0] https://arxiv.org/pdf/2204.07816.pdf

I heard they won’t submit anymore so as to not draw further scrutiny and possible sanctions onto their suppliers. Not sure if true, but keeping a low profile certainly makes sense given the blows dealt to the more visible vendors in the past few years.
What are the concerns for vendors?
US banned the sale of American HPC components to Chinese supercomputers.

https://news.ycombinator.com/item?id=9349116 (2015, 93 comments)

https://news.ycombinator.com/item?id=26740371 (2021, 151 comments) etc.

They also prevent Chinese supercomputing related companies having their chips fabbed in Taiwan.
Let them build those machines and the f they are any good we can steal all their ideas. Turnabout is fair play.
> and relies on gigabit ethernet for data transfer.

This seems suprising to me, I would have expected 10Gb at least, if not something like inifiniband.

That's a typo, Frontier uses the Slingshot network from Cray/HPE. The table below has the correct information.
seems to be a proprietary interconnect that is “Ethernet compatible”

https://www.hpe.com/us/en/compute/hpc/slingshot-interconnect...

As far as I know, Slingshot uses a layer 1 that is exactly the same as Ethernet and allows layer 2 ethernet packets to enter switches. However, it has several layer 1/2 extensions that let it look more like Infiniband to use cases that need it, including flow control and congestion control.
It's wrong, and quite a funny typo. The interconnect is 100 gigabit.

https://www.olcf.ornl.gov/frontier/#4

> 8,730,112 total cores

This must include the GPUs, otherwise it'd be 136,408 sockets. For a 42U rack with 4P 1U servers (not that that's what's in use, but to give an understandable napkin figure), that'd be 812 racks.

Frontier's own page says 74 "cabinets"/racks, and this is just for the compute (and perhaps switching and/or power? storage is elsewhere). Made up of 9408 nodes, with 4 MI250X gpu accelerators each- those accelerators being dual chip + 8x HBMe2 memory a piece monsters. From Anandtech[1], we can see the liquid-cooled half-width sleds are dual socket, and packed packed packed.

[1] https://www.anandtech.com/show/17074/amds-instinct-mi250x-re...

Bit hard to guess what a 'core' would be on a gpu. Compute unit / streaming multiprocessor perhaps.
Back in the 2010 timeframe, there were articles about how an Exascale Supercomputers might be impossible. Would be interesting if someone could go back and assess where those predictions were wrong and where they held, and how the architecture changed to get around those true scaling limits.
Power efficiency mostly. The power requirements of an exascale machine with 2010-timeframe hardware would be crazy.
Oak Ridge still consumes 20 megawatts. However older technology was appearing to require a gigawatt.
That really helps me appreciate the power efficiency gains! So much improvement in so short time...
But unfortunately power efficiency of GPUs seems to be levelling off recently
https://www.govinfo.gov/content/pkg/CHRG-113hhrg81195/pdf/CH...

This was a governmental report on the challenges of Exascale. Contributors included major universities and all the US supercomputing facilities. It wasn’t that they overlooked the possibility of Moore’s law continuing and associated power reduction.

https://www.olcf.ornl.gov/2021/10/18/exascale-computings-fou...

Just found the article which explains the gains. Mostly GPU. A billion processors doing a billion fips each.

I used to be really excited about supercomputers. It's part of why I pursued HPC-related work.

But I think that having no interest in their actual applications has curbed my enthusiasm. I wish I could make a good living in something that interested more.

I love the applications, but I'm dismayed at the stagnation in programming models used to get the best performance out of modern clusters. This sums up my feelings:

https://www.usenix.org/conference/atc21/presentation/fri-key...

Could work on the supercomputer hardware/toolchain/libraries instead of the applications
The spec sheet mentions they're moving from CUDA powering their prior supercomputer to "HIP" for this one. This is the first I've heard of HIP, does anyone have experience with it? My impression was that GPU programming tended to mean CUDA, which isn't cross platform (as opposed to HIP).

https://developer.amd.com/resources/rocm-learning-center/fun....

HIP is basically CUDA with s/cuda/hip/g.

My experience is that the stack is pretty rough around the edges. But when it works, you (almost) literally find-and-replace, and it pretty much works as advertised. However, just because you can get to a correct code doesn't necessarily mean that code will achieve optimal performance (without further tuning, of course).

Not only tuning the code, but also the bazillion knobs on the GPUs themselves.
That's the upside of a supercomputer though, fixed architecture to target with enough weight that it's worthwhile.
If you have AMD gpus, then you need to use HIP to run all those CUDA applications.
I remember in early 2000's trying to convince people to use linux and being mocked that it was a "toy" or "not professional enough". While at the time I tried to argue how it was more stable, more secure and better performant than competition and even arguing that it was improving continuously, some people still made fun of me. It is a good thing I've been able, for a long time, to see this: https://www.top500.org/statistics/list/ , chose Category:"Operating system family" and click "Submit".
The writing has been on the wall since the early Beowulf success stories hit.

I'm pretty bullish on the long term survival of Linux in some form or other, proprietary OS's not so much.

“Linux doesn’t scale” was a common argument in the late 90s/early 2000s. I used to point them at the top 10 supercomputer list.
> "This HPE Cray EX system is the first US system with a peak performance exceeding one ExaFlop/s."

So, it's not actually the first one? And another one already exists outside the US?

That was an odd qualification. The only thing they mention is that the #2 computer in Japan is theoretically capable of an Exaflop, but hasn’t demonstrated it yet.
We may assume NSA has faster ones, devoted to speech transcription and codebreaking.
Yes. It is assumed that China is downplaying how capable theirs are.
21 MW power! Insane.

Interestingly, the second one is 30 MW.

For reference, Roadrunner, which was the first petascale system in 2008, used 2.35 MW (according to Wikipedia). So this one gives us 1,000 times the performance for 10 times the energy. From a performance/Watt perspective, this is an impressive improvement.

EDIT: Wikipedia also says Roadrunner was not considered power-efficient in its day, which led to it being decommissioned after only five years of operation.

That, and the apps teams hated the architecture given very poor tooling support - especially since the writing was on the wall that the future was GPGPU accelerators and the Cell was a dead end. The roadrunner processors were awesome on paper, but not so much when it came to working with them. Kind of a shame really: there were some really interesting ideas in that processor design.
This is the first I have noticed them reporting power draw. It seems immoral to run it for anything that doesn't help stop global climate catastrophe. (Presumably global thermonuclear war would suffice. But carbon capture afterward would be hard to arrange for.)

Wondering if they measure while benchmarking, or add up max power ratings of the chips.

Did any old mainframe ever burn like that? E.g. the first big USAF missile tracking system, the one that filled four floors of a custom building?

Using the secret skill of "clicking on the links for the other lists" I discovered that the first TOP500 which had a machine report a power draw was the TOP10 in November 1996: https://top500.org/lists/top500/1996/11/

(498 kW for 229 GFlops. 136,317 times more power draw per flops than the current leader on the Green500.)

Yet that tells me exactly nothing about what I asked.
Are you kidding? You know a single widebody airliner uses more energy than that, right?
Widebody airliners aren't doing much for the climate, either.
While true, I think the point is more about how minuscule 21 MW is when considered in isolation — 0.001 percent of global electricity usage.
Anybody not boggled over the idea of their program burning 21 MW for as long as it runs either can never be impressed with anything, or has no head for numbers. Computation demanding power that could hurl a fully loaded jumbo jet into the sky is not something most of us will ever code.
In other news, only 100,000 of these would match the whole world's power draw.
To me 50 megawatts is the baseline of what I would expect for a decent cluster.
Perhaps you in fact forgot how to count lower than 50.
I can't even get out of bed for less than 3 peer bonuses!
He is so fast that just counting to ten, he gets that far before he can stop.
Onward to zettaflops around 2037, assuming an order of magnitude every five years. Thats been pretty much the case for 60 years.
If they truly wanted to solve world problems, they need to allow an AGI company like DeepMind or OpenAI to use it. The people now using it are likely wasting so much money using outdated technologies.
As in Vernor Vinge's "A Fire upon the Deep", where Powers in the Great Beyond transcend existence while regular people are condemned to live out their lives running a trivial program.
It feels like it's been a long time since supercomputers were interesting. They're just oodles of identical processors connected together like legos. "We can afford more bricks than the next guy" is not exciting. When was the last time we had a "fastest supercomputer" that could do something the second-fastest couldn't also do?
Speed is just the measure of how fast it does something not a measure of what it's capable of doing. I wouldn't expect to divine more information like "what new things can it do" from that number alone outside "things we didn't have enough compute time for before we do now".

Lego style supercomputers are still very interesting in my eye though. As the technical complexity involved in scaling the raw compute performance has simplified to a "how many do you want" problem the technical complexity in the interconnects has remained interesting and innovative both for connectivity intra and inter node. You won't really see that in the FLOPS number that makes the headlines but the interconnect can be the difference between a type of workload being feasible or not. The main push here is how large you can make certain levels of shared memory access happen at what latencies to run larger jobs instead of just more jobs.

there is also a huge amount of work remaining to be done in programming models and consistency.
Well fundamentally all super computers are turing machines. So one can do X while Y can not doesn't really make sense in that context.

However the second-fastest (ARM based Fugaku) absolutely wipes the floor with the fastest in certain tasks due to a difference in interconnect topology. Fugaku futhermore has no GPUs unlike many other super computers and instead a CPU with some vector instructions, leading to a different programming model.

If you are more into specialized hardware, Anton3 is amazing.

> They're just oodles of identical processors connected together like legos.

That's the Cloud, not supercomputing. Supercomputing is all about interconnect.

I also wonder how the software side of things changes in those settings, how do people design program / algorithms around fast and wide data path like these.
I have a bit of experience programming for a highly-parallel supercomputer, specifically in my case an IBM BlueGene/Q. In that case, the answer is a lot of message passing (we used Open MPI [0]). Since the nodes are discrete and don't have any shared memory, you end up with something kinda reminiscent of the actor model as popularized by Erlang and co -- but in C for number-crunching performance.

That said, each of the nodes is itself composed of multiple cores with shared memory. So in cases where you really want to grind out performance, you actually end up using message passing to divvy up chunks of work, and then use classic pthreads to parallelize things further, with lower latency.

I forget the exact terminology used, but the parent is right that the interconnect is the "killer feature." To make that message passing fast, there's a lot of crazy topography to keep the number of hops down. The Q had nodes connected in a "torus" configuration to that end [1].

Debugging is a bit of a nightmare, though, since some bugs inevitably only come up once you have a large number of nodes running the algorithm in parallel. But you'll probably be in a mainframe-style time-sharing setup, so you may have to wait hours or more to rerun things.

This applies less to some of the newer supercomputers, which are more or less clusters of GPUs instead of clusters of CPUs. I imagine there's some commonality, but I haven't worked with any of them so I can't really say.

[0] https://www.open-mpi.org/

[1] https://www.scorec.rpi.edu/~shephard/FEP19/notes-2019/Introd...

Building the communication fabric it takes to make those oodles of identical processors to exchange and share data quickly so they don't get bogged down in their own communication overhead is a profoundly interesting problem, and by "profoundly interesting" I mean "call Richard Feynman in to help you solve it":

https://longnow.org/essays/richard-feynman-connection-machin...

Besides which, at that level the goal is not to go "look at this cool thing we built", it's more like "how do we cheaply and effectively build something that can solve these massive weather/nuclear explosion/human brain/etc. simulation problems we have?" and if ganging together lots of off-the-shelf CPUs/GPUs achieves that goal with less time, effort, and cost than building super-custom, boutique-schmoutique hardware, so be it.

Not sure about exciting, but I'd think the technical challenges, particularly regarding intra-cluster communication, can be interesting to some. There's a lot of money in it, they better do something useful (more useful then running Linpack or calculating digits of Pi), rather then being just show cases.

Said that, #1 is about twice as fast as #2, which is about three times as fast as number #3. Those gaps are much wider then I would have expected this late in the game.

You can still get the NEC SX series, which is a non-x86, non-arm vector super. They're pretty nifty. "Fastest" has gone in a different direction tho.
For comparison, 2000 SP Power3 375 MHz in Oak Ridge National Laboratory did the same order of magnitude GFlops as iPhones with A14 chip can do.
TL;DR: Wow! ~9 million cores, 21 megawatts, >2x the performance of #2 but pulling less power (compared to 30MW). #3 is 0.15EFLOPS, but also 3MW.