Hacker News new | ask | show | jobs
by londons_explore 1540 days ago
In a fleet of 100,000 machines, there will always be some clear failures... When the machine has 2x the number of segfaults of any other machine in the fleet, you send it for repairs and someone replaces the motherboard, ram and CPU... easy!

But the painful ones are the 'subtle' failures. Why does machine PABL12 sometimes give NaN as a result while all 99,999 machines return sensible numbers? But all the burn in hardware tests pass...

The solution was to simply exclude any machines that were outliers. Anything in the top or bottom 0.01% for any metric simply exclude that machine from future workloads.

Sure, in most cases there was nothing wrong with the hardware, but when you're spending hours debugging some fault caused by a sometimes-bad floating point unit on one core of one machine out of 100,000, you're just wasting your time. By auto-banning outliers, the machine will end up doing some other task where data consistency matters less.

2 comments

Was pabl12 an actual bad machine? Sounds somehow plausible, as if I heard of it before.

It was an annoying struggle trying to raise the visibility of broken CPUs during my years at Google SRE. The SRE org and the rest of the software side of Tech Infra resisted the whole concept, even though it was well-known among platforms hardware eng. The process for taking a known-bad machine out of service involved 1) the machine being reported independently by three different teams; 2) the machine continuing to be in service for days or weeks, at the leisure of some very asynchronous automation; and 3) the machine being returned immediately to service because it passed all of the cursory checks during reinstall. Really irritating. Consequently every major service had to maintain their own private blacklist.

It's nice to see that some influential people on the software side are starting to come around, with papers like "Cores That Don't Count" etc, but man they could have been on this boat a decade ago.

Reminds me of the typical story of someone with a complete damage protection plan and a flaky device. Take it in for repairs, passes all the tests, but they know it's funky, so snap it in half or otherwise completely wreck it right in front of the tech and demand that repair.
Usually teams would consider a machine "bad" if that node in the cluster had elevated errors compared to the rest of the cluster they were running. Unfortunately this doesn't tell hardware teams what actually went wrong.

If one could show that the CPU said 2+2=9, I'm sure they would yank it out right away, but "it returns 500 errors a lot" isn't very debugable. The only thing they can do is run the diag and return it to service if nothing comes up.

Well that's one of the reasons this is difficult to handle as an organization. The novice says "the machine is broken" and is mistaken. But the expert says the same thing, and is correct. Same with compiler bugs: novices believe the compiler is full of bugs, journeymen believe the compiler is infallible, but the wise return to the knowledge that the compiler is full of bugs. Maybe that company just needs "bad machine readability" or something.

And your last statement is definitely not true. I can recall multiple instances of demonstrable logic errors in which the machine repeatedly returned to service. This includes all of the machines of a certain generation of a certain vendor's CPUs that were found to have latent ALU bugs, 8 years after going into service.

> When the machine has 2x the number of segfaults of any other machine in the fleet, you send it for repairs

At that scale, it's quite likely sent to repair automatically and whoever's on call just gets a notification.