|
I've had actual, real-life deployments in datacentres where we just left dead hardware in the racks until we needed the space, and we rarely did. Typically we'd visit a couple of times a year, because it was cheap to do so, but it'd have totally viable to let failures accumulate over a much longer time horizon. Failure rates tend to follow a bathtub curve, so if you burn-in the hardware before launch, you'd expect low failure rates for a long period and it's quite likely it'd be cheaper to not replace components and just ensure enough redundancy for key systems (power, cooling, networking) that you could just shut down and disable any dead servers, and then replace the whole unit when enough parts have failed. |
Side: Thanks for sharing about the "bathtub curve", as TIL and I'm surprised I haven't heard of this before especially as it's related to reliability engineering (as from searching on HN (Algolia) that no HN post about the bathtub curve crossed 9 points).