|
|
|
|
|
by bryan_w
1540 days ago
|
|
Usually teams would consider a machine "bad" if that node in the cluster had elevated errors compared to the rest of the cluster they were running. Unfortunately this doesn't tell hardware teams what actually went wrong. If one could show that the CPU said 2+2=9, I'm sure they would yank it out right away, but "it returns 500 errors a lot" isn't very debugable. The only thing they can do is run the diag and return it to service if nothing comes up. |
|
And your last statement is definitely not true. I can recall multiple instances of demonstrable logic errors in which the machine repeatedly returned to service. This includes all of the machines of a certain generation of a certain vendor's CPUs that were found to have latent ALU bugs, 8 years after going into service.