Hacker News new | ask | show | jobs
by ipsum2 1241 days ago
Getting reliable GPUs is a difficult problem, I empathize. I've spent a decent amount of time and money because there was one failing GPU on an AWS cluster.
1 comments

We've come to accept that it's an impossible problem at this point. Instead, we're getting good at automatically detecting hardware failures and rapidly restarting runs on fewer nodes. We're also exploring batch sizes that are (where possible) divisible by N nodes and N-1 nodes. Fault tolerant system design is unfortunately an evergreen topic in CS.