|
|
|
|
|
by iJohnDoe
647 days ago
|
|
The analogies used in this article were a bit weird. Two things I’ve always wondered since I’m not an expert. 1. Obviously, applications must be written to run effectively to distribute the load across the supercomputer. I wonder how often this prevents useful things from being considered to run on the supercomputer. 2. It always seems like getting access to run anything on the supercomputer is very competitive or even artificially limited? A shame this isn’t open to more people. That much processing resources seems like it should go much further to be utilized for more things. |
|
One of the main differences between supercomputers and eg a datacenter is that in the former case, application authors do not, as a rule, assume hardware or network issues and engineer around them. A typical supercomputer workload will fail overall if any one of its hundreds or thousands of workers fail. This assumption greatly simplifies the work of writing such software, as error handling is typically one of the biggest, if not the biggest, sources of complexity a distributed system. It makes engineering the hardware much harder, of course, but that’s how HPE makes money.
A second difference is that RDMA (Remote Direct Memory Access—the ability for one computer to access another computer’s memory without going through its CPU. The network card can access memory directly) is standard. This removes all the complexity of an RPC framework from supercomputer workloads. Also, the L1 protocol used has orders of magnitude lower latency than Ethernet, such that it’s often faster to read memory on a remote machine than do any kind of local caching.
The result is that the frameworks for writing these workloads let you more or less call an arbitrary function, run it on a neighbor, and collect the result in roughly the same amount of time it would’ve taken to run it locally.