|
|
|
|
|
by lobochrome
849 days ago
|
|
In those cases, the CPU makes a false calculation independent of what's done in RAM. It can be solved by having flop redundancy as in system z - but nobody at Google or Meta would be considering big metal. From my point of view, this technology problem may be interesting academically (and good for pretending to be important in the hierarchy at those companies) but a non-issue at scale business-wise in modern data centers. Have a blade that once in a while acts funny? Trash and replace. Who cares what particular hiccup the CPU had. |
|
I've worked on similar stuff in the past at Google and you couldn't be more wrong. For example, if your CPU screwed up an AES calculation involved in wrapping an encryption key, you might end up with fairly large amounts of data that can't be decrypted anymore. Sometimes the failures are symmetric enough that the same machine might be able to decrypt the data it corrupted, which means a single machine might not be able to easily detect such problems.
We used to run extensive crypto self testing as part of the initialization of our KMS service for that reason.