|
|
|
|
|
by jfrankle
1243 days ago
|
|
It's an indictment of the A100 node that died on us yesterday, leaving us with 248 GPUs in the particular cluster where we were running the experiments :( It turns out that, in these kinds of large-scale experiments, hardware failures are a constant fact of life, and we have tools to manage these hardware failures and allow runs to continue anyway. Unfortunately, it would mess up our throughput calculations for getting clean baselines here, so we're waiting for our cloud provider to kindly replace the bad A100. Expect those numbers in the next day or so. |
|