| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by guenthert 691 days ago

> A typical supercomputer workload will fail overall if any one of its hundreds or thousands of workers fail.

HPC applications were driving software checkpointing. If a job runs for days, it's not all that unlikely that one of hundreds of machines fails. Simultaneously, re-running a large job, is fairly costly on such a system.

Now, while that exists, I don't know how typical this is actually used. In my own, very limited, experience, it wasn't and job-failures due to hardware failure were rare. But then, the cluster(s) I tended to were much smaller, up to some 100 nodes each.