|
|
|
|
|
by sgeisenh
731 days ago
|
|
I did some of the work in the post (though mostly post-setup). Speaking in generalities: the initial failure rates of these units are much higher than those of traditional non-GPU machines. In general, the failure rates decline significantly during the operating life of hardware. So you deal with a bunch of issues up front that you try to resolve to reach a much more stable state. There was a recent Meta engineering blog post that echoed some of our own experiences wrangling GPUs and high performance networks: https://engineering.fb.com/2024/06/12/data-infrastructure/tr... |
|
It's the other stuff I was more surprised about. I would have guessed that having your Ethernet cables plugged in and power supplies tested was table stakes nowadays. Then again, I've never been a datacenter admin...