|
|
|
|
|
by ipsum2
1241 days ago
|
|
Note that this doesn't take into account the numerous iterations required to dial in the correct hyperparameters and model architecture, which could easily increase cost 5-10x. > 256 A100 throughput was extrapolated using the other throughput measurements Is it an indictment of their service that they couldn't afford 256 GPUs on their own cloud? |
|
It turns out that, in these kinds of large-scale experiments, hardware failures are a constant fact of life, and we have tools to manage these hardware failures and allow runs to continue anyway.
Unfortunately, it would mess up our throughput calculations for getting clean baselines here, so we're waiting for our cloud provider to kindly replace the bad A100. Expect those numbers in the next day or so.