Hacker News new | ask | show | jobs
by causi 1441 days ago
It feels like it's been a long time since supercomputers were interesting. They're just oodles of identical processors connected together like legos. "We can afford more bricks than the next guy" is not exciting. When was the last time we had a "fastest supercomputer" that could do something the second-fastest couldn't also do?
6 comments

Speed is just the measure of how fast it does something not a measure of what it's capable of doing. I wouldn't expect to divine more information like "what new things can it do" from that number alone outside "things we didn't have enough compute time for before we do now".

Lego style supercomputers are still very interesting in my eye though. As the technical complexity involved in scaling the raw compute performance has simplified to a "how many do you want" problem the technical complexity in the interconnects has remained interesting and innovative both for connectivity intra and inter node. You won't really see that in the FLOPS number that makes the headlines but the interconnect can be the difference between a type of workload being feasible or not. The main push here is how large you can make certain levels of shared memory access happen at what latencies to run larger jobs instead of just more jobs.

there is also a huge amount of work remaining to be done in programming models and consistency.
Well fundamentally all super computers are turing machines. So one can do X while Y can not doesn't really make sense in that context.

However the second-fastest (ARM based Fugaku) absolutely wipes the floor with the fastest in certain tasks due to a difference in interconnect topology. Fugaku futhermore has no GPUs unlike many other super computers and instead a CPU with some vector instructions, leading to a different programming model.

If you are more into specialized hardware, Anton3 is amazing.

> They're just oodles of identical processors connected together like legos.

That's the Cloud, not supercomputing. Supercomputing is all about interconnect.

I also wonder how the software side of things changes in those settings, how do people design program / algorithms around fast and wide data path like these.
I have a bit of experience programming for a highly-parallel supercomputer, specifically in my case an IBM BlueGene/Q. In that case, the answer is a lot of message passing (we used Open MPI [0]). Since the nodes are discrete and don't have any shared memory, you end up with something kinda reminiscent of the actor model as popularized by Erlang and co -- but in C for number-crunching performance.

That said, each of the nodes is itself composed of multiple cores with shared memory. So in cases where you really want to grind out performance, you actually end up using message passing to divvy up chunks of work, and then use classic pthreads to parallelize things further, with lower latency.

I forget the exact terminology used, but the parent is right that the interconnect is the "killer feature." To make that message passing fast, there's a lot of crazy topography to keep the number of hops down. The Q had nodes connected in a "torus" configuration to that end [1].

Debugging is a bit of a nightmare, though, since some bugs inevitably only come up once you have a large number of nodes running the algorithm in parallel. But you'll probably be in a mainframe-style time-sharing setup, so you may have to wait hours or more to rerun things.

This applies less to some of the newer supercomputers, which are more or less clusters of GPUs instead of clusters of CPUs. I imagine there's some commonality, but I haven't worked with any of them so I can't really say.

[0] https://www.open-mpi.org/

[1] https://www.scorec.rpi.edu/~shephard/FEP19/notes-2019/Introd...

Building the communication fabric it takes to make those oodles of identical processors to exchange and share data quickly so they don't get bogged down in their own communication overhead is a profoundly interesting problem, and by "profoundly interesting" I mean "call Richard Feynman in to help you solve it":

https://longnow.org/essays/richard-feynman-connection-machin...

Besides which, at that level the goal is not to go "look at this cool thing we built", it's more like "how do we cheaply and effectively build something that can solve these massive weather/nuclear explosion/human brain/etc. simulation problems we have?" and if ganging together lots of off-the-shelf CPUs/GPUs achieves that goal with less time, effort, and cost than building super-custom, boutique-schmoutique hardware, so be it.

Not sure about exciting, but I'd think the technical challenges, particularly regarding intra-cluster communication, can be interesting to some. There's a lot of money in it, they better do something useful (more useful then running Linpack or calculating digits of Pi), rather then being just show cases.

Said that, #1 is about twice as fast as #2, which is about three times as fast as number #3. Those gaps are much wider then I would have expected this late in the game.

You can still get the NEC SX series, which is a non-x86, non-arm vector super. They're pretty nifty. "Fastest" has gone in a different direction tho.