Getting reliable GPUs is a difficult problem, I empathize. I've spent a decent amount of time and money because there was one failing GPU on an AWS cluster.
We've come to accept that it's an impossible problem at this point. Instead, we're getting good at automatically detecting hardware failures and rapidly restarting runs on fewer nodes. We're also exploring batch sizes that are (where possible) divisible by N nodes and N-1 nodes. Fault tolerant system design is unfortunately an evergreen topic in CS.