| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by jeffbee 1540 days ago

Was pabl12 an actual bad machine? Sounds somehow plausible, as if I heard of it before.

It was an annoying struggle trying to raise the visibility of broken CPUs during my years at Google SRE. The SRE org and the rest of the software side of Tech Infra resisted the whole concept, even though it was well-known among platforms hardware eng. The process for taking a known-bad machine out of service involved 1) the machine being reported independently by three different teams; 2) the machine continuing to be in service for days or weeks, at the leisure of some very asynchronous automation; and 3) the machine being returned immediately to service because it passed all of the cursory checks during reinstall. Really irritating. Consequently every major service had to maintain their own private blacklist.

It's nice to see that some influential people on the software side are starting to come around, with papers like "Cores That Don't Count" etc, but man they could have been on this boat a decade ago.

2 comments

mjevans 1540 days ago

Reminds me of the typical story of someone with a complete damage protection plan and a flaky device. Take it in for repairs, passes all the tests, but they know it's funky, so snap it in half or otherwise completely wreck it right in front of the tech and demand that repair.

link

bryan_w 1540 days ago

Usually teams would consider a machine "bad" if that node in the cluster had elevated errors compared to the rest of the cluster they were running. Unfortunately this doesn't tell hardware teams what actually went wrong.

If one could show that the CPU said 2+2=9, I'm sure they would yank it out right away, but "it returns 500 errors a lot" isn't very debugable. The only thing they can do is run the diag and return it to service if nothing comes up.

link

jeffbee 1540 days ago

Well that's one of the reasons this is difficult to handle as an organization. The novice says "the machine is broken" and is mistaken. But the expert says the same thing, and is correct. Same with compiler bugs: novices believe the compiler is full of bugs, journeymen believe the compiler is infallible, but the wise return to the knowledge that the compiler is full of bugs. Maybe that company just needs "bad machine readability" or something.

And your last statement is definitely not true. I can recall multiple instances of demonstrable logic errors in which the machine repeatedly returned to service. This includes all of the machines of a certain generation of a certain vendor's CPUs that were found to have latent ALU bugs, 8 years after going into service.

link