| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by ioedward 1156 days ago
	Nvidia's enterprise GPUs are surprisingly unreliable. Working on a 128 GPU A100 cluster on AWS, 1 would fail every few days. I didn't have any insight on whether it was a hardware or software failure.

1 comments

csdvrx 1156 days ago

> Working on a 128 GPU A100 cluster on AWS, 1 would fail every few days

Define "fail".

> I didn't have any insight on whether it was a hardware or software failure.

Have scripts check nvidia-smi for ECC errors and dmesg for devices dropping of the PCI bus.

For the former, replace the card. For the later, just perform a device reset (a power toggle of the device and a rescan of the bus is often enough to be back online within 5 seconds)

link

robotresearcher 1156 days ago

What does an AWS user do with this advice?

link

csdvrx 1156 days ago

They either figure out how to write scripts or ask AWS support how to get that done.

link