|
|
|
|
|
by londons_explore
1540 days ago
|
|
In a fleet of 100,000 machines, there will always be some clear failures... When the machine has 2x the number of segfaults of any other machine in the fleet, you send it for repairs and someone replaces the motherboard, ram and CPU... easy! But the painful ones are the 'subtle' failures. Why does machine PABL12 sometimes give NaN as a result while all 99,999 machines return sensible numbers? But all the burn in hardware tests pass... The solution was to simply exclude any machines that were outliers. Anything in the top or bottom 0.01% for any metric simply exclude that machine from future workloads. Sure, in most cases there was nothing wrong with the hardware, but when you're spending hours debugging some fault caused by a sometimes-bad floating point unit on one core of one machine out of 100,000, you're just wasting your time. By auto-banning outliers, the machine will end up doing some other task where data consistency matters less. |
|
It was an annoying struggle trying to raise the visibility of broken CPUs during my years at Google SRE. The SRE org and the rest of the software side of Tech Infra resisted the whole concept, even though it was well-known among platforms hardware eng. The process for taking a known-bad machine out of service involved 1) the machine being reported independently by three different teams; 2) the machine continuing to be in service for days or weeks, at the leisure of some very asynchronous automation; and 3) the machine being returned immediately to service because it passed all of the cursory checks during reinstall. Really irritating. Consequently every major service had to maintain their own private blacklist.
It's nice to see that some influential people on the software side are starting to come around, with papers like "Cores That Don't Count" etc, but man they could have been on this boat a decade ago.