Hacker News new | ask | show | jobs
by davidmr 1351 days ago
I’m not sure I understand what you mean. I’ve run HPC clusters for a long time now, and node failures are just a fact of life. If 3 or 4 nodes of your 500 node cluster are down for a few days while you wait for RMA parts to arrive, you haven’t lost much value. Your cluster is still functioning at nearly peak capacity.

You have a handful of nodes that the cluster can’t function without (scheduler, fileservers, etc), but you buy spares and 24x7 contracts for those nodes.

Did I misunderstand your comment?