Hacker News new | ask | show | jobs
by ipsum2 1241 days ago
Note that this doesn't take into account the numerous iterations required to dial in the correct hyperparameters and model architecture, which could easily increase cost 5-10x.

> 256 A100 throughput was extrapolated using the other throughput measurements

Is it an indictment of their service that they couldn't afford 256 GPUs on their own cloud?

1 comments

It's an indictment of the A100 node that died on us yesterday, leaving us with 248 GPUs in the particular cluster where we were running the experiments :(

It turns out that, in these kinds of large-scale experiments, hardware failures are a constant fact of life, and we have tools to manage these hardware failures and allow runs to continue anyway.

Unfortunately, it would mess up our throughput calculations for getting clean baselines here, so we're waiting for our cloud provider to kindly replace the bad A100. Expect those numbers in the next day or so.

Getting reliable GPUs is a difficult problem, I empathize. I've spent a decent amount of time and money because there was one failing GPU on an AWS cluster.
We've come to accept that it's an impossible problem at this point. Instead, we're getting good at automatically detecting hardware failures and rapidly restarting runs on fewer nodes. We're also exploring batch sizes that are (where possible) divisible by N nodes and N-1 nodes. Fault tolerant system design is unfortunately an evergreen topic in CS.