|
|
|
|
|
by csdvrx
1115 days ago
|
|
> Working on a 128 GPU A100 cluster on AWS, 1 would fail every few days Define "fail". > I didn't have any insight on whether it was a hardware or software failure. Have scripts check nvidia-smi for ECC errors and dmesg for devices dropping of the PCI bus. For the former, replace the card. For the later, just perform a device reset (a power toggle of the device and a rescan of the bus is often enough to be back online within 5 seconds) |
|