Hacker News new | ask | show | jobs
by lobochrome 849 days ago
In those cases, the CPU makes a false calculation independent of what's done in RAM. It can be solved by having flop redundancy as in system z - but nobody at Google or Meta would be considering big metal.

From my point of view, this technology problem may be interesting academically (and good for pretending to be important in the hierarchy at those companies) but a non-issue at scale business-wise in modern data centers.

Have a blade that once in a while acts funny? Trash and replace. Who cares what particular hiccup the CPU had.

1 comments

> a non-issue at scale business-wise in modern data centers.

I've worked on similar stuff in the past at Google and you couldn't be more wrong. For example, if your CPU screwed up an AES calculation involved in wrapping an encryption key, you might end up with fairly large amounts of data that can't be decrypted anymore. Sometimes the failures are symmetric enough that the same machine might be able to decrypt the data it corrupted, which means a single machine might not be able to easily detect such problems.

We used to run extensive crypto self testing as part of the initialization of our KMS service for that reason.

Sure. It’s a cool issue to work on and maybe actually relevant at Google scale. But I’ve asked your colleagues multiple time if the business side actually cared about the issue and they never confirmed.

Again, cool to work on at Google. Not sure anybody else cares. If you care (finance) you fix it in hardware (system z).

Why would the business side ever care about technical details? It's like asking the business what days the dumpsters get emptied. Nobody gives a fuck; they just care that it gets done and gets done quickly, correctly, and safely.
A CFO knows which factors have a significant impact on the bottom line.
If a CFO knows which days the dumpster is emptied, you have a strange CFO. The metaphor is to point out that there’s a lot of technical details that aren’t tracked (like usually refactoring isn’t tracked independently) and shouldn’t be tracked because they are the normal part of the technical job. A CFO can’t measure it even if they wanted to because nobody else is measuring crazy things, like how fast you walk to the bathroom or any other metrics that are specifically related to doing your job.