Hacker News new | ask | show | jobs
by sgeisenh 731 days ago
I did some of the work in the post (though mostly post-setup).

Speaking in generalities: the initial failure rates of these units are much higher than those of traditional non-GPU machines.

In general, the failure rates decline significantly during the operating life of hardware. So you deal with a bunch of issues up front that you try to resolve to reach a much more stable state.

There was a recent Meta engineering blog post that echoed some of our own experiences wrangling GPUs and high performance networks: https://engineering.fb.com/2024/06/12/data-infrastructure/tr...

1 comments

I have also heard that failure rates on new GPUs are very high (approaching 20% if not burnt in), so that's unsurprising.

It's the other stuff I was more surprised about. I would have guessed that having your Ethernet cables plugged in and power supplies tested was table stakes nowadays. Then again, I've never been a datacenter admin...