> So Google found fail-silent Corruption Execution Errors (CEEs) at CPU/cores. This is interesting because we thought tested CPUs do not have logic errors, and if they had an error it would be a fail-stop or at least fail-noisy hardware errors triggering machine checks. Previously we had known about fail-silent storage and network errors due to bit flips, but the CEEs are new because they are computation errors. While it is easy to detect data corruption due to bit flips, it is hard to detect CEEs because they are rare and require expensive methods to detect/correct in real-time.
> The paper claims that silent data corruptions can occur due to device characteristics and are repeatable at scale. They observed that these failures are reproducible and not transient. Then, how come did these CPUs pass the quality control tests by the chip producers? In soft-error based fault injection studies by chip producers, CPU CEEs are evaluated to be a one in a million occurrence, not 1 in 1000 observed at deployment at Facebook and Google... The paper also says that increased density, technology scaling, and wider datapaths increase the probability of silent errors.
> So Google found fail-silent Corruption Execution Errors (CEEs) at CPU/cores. This is interesting because we thought tested CPUs do not have logic errors, and if they had an error it would be a fail-stop or at least fail-noisy hardware errors triggering machine checks. Previously we had known about fail-silent storage and network errors due to bit flips, but the CEEs are new because they are computation errors. While it is easy to detect data corruption due to bit flips, it is hard to detect CEEs because they are rare and require expensive methods to detect/correct in real-time.
https://muratbuffalo.blogspot.com/2021/06/silent-data-corrup...
> The paper claims that silent data corruptions can occur due to device characteristics and are repeatable at scale. They observed that these failures are reproducible and not transient. Then, how come did these CPUs pass the quality control tests by the chip producers? In soft-error based fault injection studies by chip producers, CPU CEEs are evaluated to be a one in a million occurrence, not 1 in 1000 observed at deployment at Facebook and Google... The paper also says that increased density, technology scaling, and wider datapaths increase the probability of silent errors.