| HN Mirror

I have a bit of experience programming for a highly-parallel supercomputer, specifically in my case an IBM BlueGene/Q. In that case, the answer is a lot of message passing (we used Open MPI [0]). Since the nodes are discrete and don't have any shared memory, you end up with something kinda reminiscent of the actor model as popularized by Erlang and co -- but in C for number-crunching performance.

That said, each of the nodes is itself composed of multiple cores with shared memory. So in cases where you really want to grind out performance, you actually end up using message passing to divvy up chunks of work, and then use classic pthreads to parallelize things further, with lower latency.

I forget the exact terminology used, but the parent is right that the interconnect is the "killer feature." To make that message passing fast, there's a lot of crazy topography to keep the number of hops down. The Q had nodes connected in a "torus" configuration to that end [1].

Debugging is a bit of a nightmare, though, since some bugs inevitably only come up once you have a large number of nodes running the algorithm in parallel. But you'll probably be in a mainframe-style time-sharing setup, so you may have to wait hours or more to rerun things.

This applies less to some of the newer supercomputers, which are more or less clusters of GPUs instead of clusters of CPUs. I imagine there's some commonality, but I haven't worked with any of them so I can't really say.

[0] https://www.open-mpi.org/

[1] https://www.scorec.rpi.edu/~shephard/FEP19/notes-2019/Introd...