|
|
|
|
|
by msteffen
650 days ago
|
|
My former employer (Pachyderm) was acquired by HPE, who built Frontier (and sells supercomputers in general), and I’ve learned a lot about that area since the acquisition. One of the main differences between supercomputers and eg a datacenter is that in the former case, application authors do not, as a rule, assume hardware or network issues and engineer around them. A typical supercomputer workload will fail overall if any one of its hundreds or thousands of workers fail. This assumption greatly simplifies the work of writing such software, as error handling is typically one of the biggest, if not the biggest, sources of complexity a distributed system. It makes engineering the hardware much harder, of course, but that’s how HPE makes money. A second difference is that RDMA (Remote Direct Memory Access—the ability for one computer to access another computer’s memory without going through its CPU. The network card can access memory directly) is standard. This removes all the complexity of an RPC framework from supercomputer workloads. Also, the L1 protocol used has orders of magnitude lower latency than Ethernet, such that it’s often faster to read memory on a remote machine than do any kind of local caching. The result is that the frameworks for writing these workloads let you more or less call an arbitrary function, run it on a neighbor, and collect the result in roughly the same amount of time it would’ve taken to run it locally. |
|
HPC applications were driving software checkpointing. If a job runs for days, it's not all that unlikely that one of hundreds of machines fails. Simultaneously, re-running a large job, is fairly costly on such a system.
Now, while that exists, I don't know how typical this is actually used. In my own, very limited, experience, it wasn't and job-failures due to hardware failure were rare. But then, the cluster(s) I tended to were much smaller, up to some 100 nodes each.