Hacker News new | ask | show | jobs
by opisthenar84 847 days ago
Might be a noob question but for truly important data, couldn't SDCs be detected by using ECC everywhere?
3 comments

ECC isn’t free and ECC has a limited ability to detect all statistically plausible errors. Additionally, error correction in hardware is frequently defined by standards, some of which have backward compatibility requirements that go back decades. This is why, for example, reliable software often uses (quasi-)cryptographic checksums at all I/O boundaries. There is error correction in the hardware but in some parts of the silicon that error correction is weak enough that it is likely to eventually deliver a false negative in large scale systems.

None of this is free, and there are both hardware and software solutions for mitigating various categories of risk. It is explicitly modeled as an economics problem i.e. how does the cost of not mitigating a risk, if it materializes, compare to the cost of minimizing or eliminating it. In many cases, the optimal solution is unintuitive, such as computing everything twice or thrice and comparing the results rather than using error correction.

There are errors within the CPU.

As for adding ECC within the CPU, I think that would require you to essentially have a second CPU in parallel to compare against.

Actually it's not uncommon for there to be ECC used within components as a method to guard against stuff like this. I don't think it's practical to ever have complete coverage without going full blown dual/triple redundant CPU but for stuff like SSD controllers they have ECC coverage internally on the data path.
Caches, register files, and coherency traffic all definitely include error-correction.
You actually need 3 - which is how it's done for space (I believe SpaceX uses this as a solution to avoiding radiation hardened costs).

2 will tell you if they diverge, but you lose both if they do. 3 let's you retain 2 in operation if one does diverge.

If you're not hard realtime 2 is enough, you just redo the computation.
But if it's a consistent fault, like the silent data corruption covered in the linked paper, redoing the computation is still going to end up with no way to identify which core is faulty. If it's an intermittent fault, then even for hard realtime you can accomplish that with one core, just compute 3x and go with majority result.
Yup exactly. The only way independent hardware can help is if the fault is state dependent in a way on the hardware (eg differences in behavior due to thermal load or different internal state corruption or something) in which case repeated computations may not help if the repeated computation is not sufficiently decoupled temporally to get rid of that state. The other thing with independent hardware is that you don’t pay a 3x performance penalty (instead 3x cost penalty). That being said, none of these fault modes are what are really what is being discussed in the paper.

The other one that freaks me out is miscompilation by the compiler and JITs in the data path of an application. Like we’re using these machines to process hundreds of millions of transactions and trillions of dollars - how much are these silent mistakes costing us?

I think that strictly looking at it in terms of money-related operations stuff can still be managed/double-checked externally, i.e. by the real world, which means that whatever mistakes/inconsistencies might show up there's still a "hard reality" out there that will start screaming "hey! this money figure is not correct!" because people tend to notice when there are big money-discrepancies and the "mistakes" are, generally speaking, reversible when it comes to money.

What's worrying is when systems like these get used in real-time life-and-death situations, and there's basically no reversibility because that would imply dead people returning to life. For example the code used for stuff like outer space exploration, sure that right now we can add lots and lots of redundancies and check-ups in the software being used in that domain because the money is there to be spent and we still don't have that many people out there in space. But what will happen when we'll think of hosting hundreds, even thousands of people inside a big orbital station? How will we be able to make sure that all the safety-related code for that very big structure (certainly much bigger than we have now in space) doesn't cause the whole thing to go kaboom based on an unknown-unknown software error?

And leaving aside scenarios that are not there yet, right now we've started using software more and more when it comes to warfare (for example for battle simulations based on which real-life decisions are taken), what will happen to the lives of soldiers whose conduct in war has been lead by faulty software?

If it's consistent and persistent, wouldn't that classify as broken hardware requiring device change?

Even with 3 chips, if one is permanently wrong you are then left with only 2 working ones so no redundancy is left for further degradation.

> just compute 3x

That might be difficult if CPU is broken. How are you sure you actually computed 3 times if you can't trust the logic.

>wouldn't that classify as broken hardware requiring device change?

Yes but you need to catch it first to know what to take out of production.

>That might be difficult if CPU is broken. How are you sure you actually computed 3 times if you can't trust the logic.

That's kind of my point. Either it's a heisen-bug and you never see those results again when you repeat the original program or it's permanently broken and you need to swap out the sketchy CPU. If you only care about the first case then you only need one core. If you care about the second case then you need 3 if you want to come up with an accurate result instead of just determining that one of them is faulty. It's like that old adage about clocks on ships. Either take one clock or take three, never two.

In those cases, the CPU makes a false calculation independent of what's done in RAM. It can be solved by having flop redundancy as in system z - but nobody at Google or Meta would be considering big metal.

From my point of view, this technology problem may be interesting academically (and good for pretending to be important in the hierarchy at those companies) but a non-issue at scale business-wise in modern data centers.

Have a blade that once in a while acts funny? Trash and replace. Who cares what particular hiccup the CPU had.

> a non-issue at scale business-wise in modern data centers.

I've worked on similar stuff in the past at Google and you couldn't be more wrong. For example, if your CPU screwed up an AES calculation involved in wrapping an encryption key, you might end up with fairly large amounts of data that can't be decrypted anymore. Sometimes the failures are symmetric enough that the same machine might be able to decrypt the data it corrupted, which means a single machine might not be able to easily detect such problems.

We used to run extensive crypto self testing as part of the initialization of our KMS service for that reason.

Sure. It’s a cool issue to work on and maybe actually relevant at Google scale. But I’ve asked your colleagues multiple time if the business side actually cared about the issue and they never confirmed.

Again, cool to work on at Google. Not sure anybody else cares. If you care (finance) you fix it in hardware (system z).

Why would the business side ever care about technical details? It's like asking the business what days the dumpsters get emptied. Nobody gives a fuck; they just care that it gets done and gets done quickly, correctly, and safely.
A CFO knows which factors have a significant impact on the bottom line.
If a CFO knows which days the dumpster is emptied, you have a strange CFO. The metaphor is to point out that there’s a lot of technical details that aren’t tracked (like usually refactoring isn’t tracked independently) and shouldn’t be tracked because they are the normal part of the technical job. A CFO can’t measure it even if they wanted to because nobody else is measuring crazy things, like how fast you walk to the bathroom or any other metrics that are specifically related to doing your job.