| HN Mirror

There are several different aspects which make recovery hard. HPC tries to push the edge of what's possible with hardware. It does this by throwing redundancy out the window.

First, the simulation can be set up to match the hardware. One simulation program I used expected that the nodes would be set up in a ring, so that messages between i and (i+1)%N were cheap. It ran on hardware with two network ports, one forwards and one backwards in the ring. In fact, the only way to talk between non-neighbors was to forwards through the neighbors.

If a node goes down, then the ring is broken, and the entire system goes down.

This is very different than a cluster with point-to-point communications, where a router can redirect a message to a backup node should one of the main nodes go down.

The reason for this architecture is that there's a lot of inter-node traffic. When I was working on this topic back in the 1990s, we were network limited until we switched to fiber optic/ATM. When you read about HPC you'll hear a lot about high-speed interconnects, and using DMA-based communication instead of TCP for higher performance. All of this is to reduce network contention.

Suppose there's 1GB/s of network traffic for each node. (High-end clusters use InfiniBand to get this performance.) In order to have a backup handy, all of that data for each node needs to be replicated somewhere. That's more network traffic. Presumable there are many fewer spare nodes than real nodes, since otherwise that's a lot of expensive hardware that's only rarely used. If there are 512 real nodes and 1 backup node, than that backup node has to handle 512GB/second. Of course, the backup node can die, so you really want to have several nodes, each with a huge amount of bandwidth.

Even then, the messages only exchange part of the state data. For example, in a spatial decomposition, each node might handle (say) 1,000 cells of a larger grid. The contents of a cell can interact with each other, and with the contents of its neighbor cells, up to some small radius away. (For simplicity, assume the radius is only one cell away, so there are 26 neighbors for each cell.)

If one node hosts one cell and another node hosts another then at each step they will have to exchange cell contents, in order to compute the interactions. This requires network overhead.

On the other hand, a good spatial decomposition will minimize the amount of network traffic by putting most neighbors on the same machine. After all, memory bandwidth is higher than network, and doesn't have the same contention issues.

But this means that the node has mutating state which isn't easily observed by recording and replaying the network. Instead, the backup node needs to get a complete state update of the entire system.

This is a checkpoint. But notice that I used a spatial decomposition to minimize network usage by not sending all of the data all of the time? I've thrown that out of the window. Now I need to checkpoint all of the time, and have the ability to replay the network requests that the node is involved in, should it go down.

This is complicated, and will likely exceed what the hardware can do, given that it's already using high-end hardware for the normal operations.