Hacker News new | ask | show | jobs
by iJohnDoe 647 days ago
The analogies used in this article were a bit weird.

Two things I’ve always wondered since I’m not an expert.

1. Obviously, applications must be written to run effectively to distribute the load across the supercomputer. I wonder how often this prevents useful things from being considered to run on the supercomputer.

2. It always seems like getting access to run anything on the supercomputer is very competitive or even artificially limited? A shame this isn’t open to more people. That much processing resources seems like it should go much further to be utilized for more things.

2 comments

My former employer (Pachyderm) was acquired by HPE, who built Frontier (and sells supercomputers in general), and I’ve learned a lot about that area since the acquisition.

One of the main differences between supercomputers and eg a datacenter is that in the former case, application authors do not, as a rule, assume hardware or network issues and engineer around them. A typical supercomputer workload will fail overall if any one of its hundreds or thousands of workers fail. This assumption greatly simplifies the work of writing such software, as error handling is typically one of the biggest, if not the biggest, sources of complexity a distributed system. It makes engineering the hardware much harder, of course, but that’s how HPE makes money.

A second difference is that RDMA (Remote Direct Memory Access—the ability for one computer to access another computer’s memory without going through its CPU. The network card can access memory directly) is standard. This removes all the complexity of an RPC framework from supercomputer workloads. Also, the L1 protocol used has orders of magnitude lower latency than Ethernet, such that it’s often faster to read memory on a remote machine than do any kind of local caching.

The result is that the frameworks for writing these workloads let you more or less call an arbitrary function, run it on a neighbor, and collect the result in roughly the same amount of time it would’ve taken to run it locally.

> A typical supercomputer workload will fail overall if any one of its hundreds or thousands of workers fail.

HPC applications were driving software checkpointing. If a job runs for days, it's not all that unlikely that one of hundreds of machines fails. Simultaneously, re-running a large job, is fairly costly on such a system.

Now, while that exists, I don't know how typical this is actually used. In my own, very limited, experience, it wasn't and job-failures due to hardware failure were rare. But then, the cluster(s) I tended to were much smaller, up to some 100 nodes each.

I wouldn’t be surprised if the nice guarantees given by scientific supercomputers came from the time when mainframes were the only game in town for scientific computing.
I feel like the name "supercomputer" is overhyped. It's just many normal x86 machines running Linux and connected with fast network.

Here in Finland I think you can use LUMI supercomputer for free. With a condition that the results should be publically available

I think you've used the "just" trap to trivialize something.

I'm surprised that Frontier is free with the same conditions; I expected researchers to need grant money or whatever to fund their time. Neat.

In the beginning they were just “Beowulf clusters” compared to “real” supercomputers. Isn’t it always like this, the romantic and exceptional is absorbed by the sheer scale of the practical and common once someone discovers a way to drive the economy at scale? Cars, aircraft, long-distance communications, now perhaps AI? Yet the words may still capture the early romance.
FYI: LUMI uses a nearly identical architecture as Frontier (AMD CPUs and GPUs), and was also made by HPE.