|
|
|
|
|
by guenthert
644 days ago
|
|
> A typical supercomputer workload will fail overall if any one of its hundreds or thousands of workers fail. HPC applications were driving software checkpointing. If a job runs for days, it's not all that unlikely that one of hundreds of machines fails. Simultaneously, re-running a large job, is fairly costly on such a system. Now, while that exists, I don't know how typical this is actually used. In my own, very limited, experience, it wasn't and job-failures due to hardware failure were rare. But then, the cluster(s) I tended to were much smaller, up to some 100 nodes each. |
|