| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by csdvrx 1115 days ago

> Working on a 128 GPU A100 cluster on AWS, 1 would fail every few days

Define "fail".

> I didn't have any insight on whether it was a hardware or software failure.

Have scripts check nvidia-smi for ECC errors and dmesg for devices dropping of the PCI bus.

For the former, replace the card. For the later, just perform a device reset (a power toggle of the device and a rescan of the bus is often enough to be back online within 5 seconds)

1 comments

robotresearcher 1114 days ago

What does an AWS user do with this advice?

link

csdvrx 1114 days ago

They either figure out how to write scripts or ask AWS support how to get that done.

link