Silent data corruptions at scale (2021) | HN Mirror

Y	Hacker News new \| ask \| show \| jobs

	Silent data corruptions at scale (2021) (arxiv.org)
	84 points by losfair 892 days ago

7 comments

userbinator 890 days ago

Very interesting topic, but rather low on detail --- really wanted to see what those 60 lines of Asm that allegedly show a faulty CPU instruction were, and also surprised that it wasn't intermittent; in my experience, CPU problems usually are intermittent and heavily dependent upon prior state, and manually stepping through with a debugger has never shown the "1+1=3" type of situation they claim. That said, I wonder if LINPACK'ing would've found it, as that is known to be a very powerful stress-test with divisive opinions among the overclocking community; some, including me, claim that a system can never be considered stable it if fails LINPACK since that is essentially showing intermittent "1+1=3" behaviour, while others are fine with "occasional" discrepancies in its output since the system otherwise appears to be stable.

jorticka 890 days ago

Like all stress tests, linpack will find some errors, but not all.

I had memory stability issues which would immediatly show under Prime95 (less than 1 minute) but pass hours of Linpack.

sirlancer 890 days ago

Prime95 is my gold standard for CPU and memory testing. Everything from desktops to HPC and clustered filesystems get a 24 hour “blend” of tests. If that passes without any instability or bit flips then we’re ready for production.

c0l0 890 days ago

In my experience, LINPACK (at least the Intel MKL on GenuineIntel combination) is both quicker and more thorough in finding setups that are not actually stable/reliable.

thfuran 890 days ago

>while others are fine with "occasional" discrepancies

I guess I'd probably be okay with that if the only thing I ever used the computer for was gaming.

dang 890 days ago

Related:

Meta quickly detects silent data corruptions at scale - https://news.ycombinator.com/item?id=30905636 - April 2022 (95 comments)

Silent Data Corruptions at Scale - https://news.ycombinator.com/item?id=27484866 - June 2021 (12 comments)

dataflow 890 days ago

Google also had a "Cores That Don't Count" paper on so-called "mercurial cores" https://news.ycombinator.com/item?id=27378624 as well as a presentation https://www.youtube.com/watch?v=QMF3rqhjYuM

ekelsen 890 days ago

I wrote an article about these affecting LLM training at https://www.adept.ai/blog/sherlock-sdc

walterbell 890 days ago

Thanks, does your blog have a working RSS feed?

opisthenar84 890 days ago

Might be a noob question but for truly important data, couldn't SDCs be detected by using ECC everywhere?

jandrewrogers 890 days ago

ECC isn’t free and ECC has a limited ability to detect all statistically plausible errors. Additionally, error correction in hardware is frequently defined by standards, some of which have backward compatibility requirements that go back decades. This is why, for example, reliable software often uses (quasi-)cryptographic checksums at all I/O boundaries. There is error correction in the hardware but in some parts of the silicon that error correction is weak enough that it is likely to eventually deliver a false negative in large scale systems.

None of this is free, and there are both hardware and software solutions for mitigating various categories of risk. It is explicitly modeled as an economics problem i.e. how does the cost of not mitigating a risk, if it materializes, compare to the cost of minimizing or eliminating it. In many cases, the optimal solution is unintuitive, such as computing everything twice or thrice and comparing the results rather than using error correction.

teaearlgraycold 890 days ago

There are errors within the CPU.

As for adding ECC within the CPU, I think that would require you to essentially have a second CPU in parallel to compare against.

MertsA 890 days ago

Actually it's not uncommon for there to be ECC used within components as a method to guard against stuff like this. I don't think it's practical to ever have complete coverage without going full blown dual/triple redundant CPU but for stuff like SSD controllers they have ECC coverage internally on the data path.

moonchild 890 days ago

Caches, register files, and coherency traffic all definitely include error-correction.

XorNot 890 days ago

You actually need 3 - which is how it's done for space (I believe SpaceX uses this as a solution to avoiding radiation hardened costs).

2 will tell you if they diverge, but you lose both if they do. 3 let's you retain 2 in operation if one does diverge.

jorticka 890 days ago

If you're not hard realtime 2 is enough, you just redo the computation.

MertsA 890 days ago

But if it's a consistent fault, like the silent data corruption covered in the linked paper, redoing the computation is still going to end up with no way to identify which core is faulty. If it's an intermittent fault, then even for hard realtime you can accomplish that with one core, just compute 3x and go with majority result.

vlovich123 890 days ago

Yup exactly. The only way independent hardware can help is if the fault is state dependent in a way on the hardware (eg differences in behavior due to thermal load or different internal state corruption or something) in which case repeated computations may not help if the repeated computation is not sufficiently decoupled temporally to get rid of that state. The other thing with independent hardware is that you don’t pay a 3x performance penalty (instead 3x cost penalty). That being said, none of these fault modes are what are really what is being discussed in the paper.

The other one that freaks me out is miscompilation by the compiler and JITs in the data path of an application. Like we’re using these machines to process hundreds of millions of transactions and trillions of dollars - how much are these silent mistakes costing us?

jorticka 890 days ago

If it's consistent and persistent, wouldn't that classify as broken hardware requiring device change?

Even with 3 chips, if one is permanently wrong you are then left with only 2 working ones so no redundancy is left for further degradation.

> just compute 3x

That might be difficult if CPU is broken. How are you sure you actually computed 3 times if you can't trust the logic.

lobochrome 890 days ago

In those cases, the CPU makes a false calculation independent of what's done in RAM. It can be solved by having flop redundancy as in system z - but nobody at Google or Meta would be considering big metal.

From my point of view, this technology problem may be interesting academically (and good for pretending to be important in the hierarchy at those companies) but a non-issue at scale business-wise in modern data centers.

Have a blade that once in a while acts funny? Trash and replace. Who cares what particular hiccup the CPU had.

delroth 890 days ago

> a non-issue at scale business-wise in modern data centers.

I've worked on similar stuff in the past at Google and you couldn't be more wrong. For example, if your CPU screwed up an AES calculation involved in wrapping an encryption key, you might end up with fairly large amounts of data that can't be decrypted anymore. Sometimes the failures are symmetric enough that the same machine might be able to decrypt the data it corrupted, which means a single machine might not be able to easily detect such problems.

We used to run extensive crypto self testing as part of the initialization of our KMS service for that reason.

lobochrome 890 days ago

Sure. It’s a cool issue to work on and maybe actually relevant at Google scale. But I’ve asked your colleagues multiple time if the business side actually cared about the issue and they never confirmed.

Again, cool to work on at Google. Not sure anybody else cares. If you care (finance) you fix it in hardware (system z).

withinboredom 890 days ago

Why would the business side ever care about technical details? It's like asking the business what days the dumpsters get emptied. Nobody gives a fuck; they just care that it gets done and gets done quickly, correctly, and safely.

lobochrome 888 days ago

A CFO knows which factors have a significant impact on the bottom line.

twhitmore 890 days ago

Interesting. The corruption was in a math.pow() calculation, representing a compressed filesize prior to a file decompression step.

Compressing data, with the increased information density & greater number of CPU instructions involved, seems obviously to increase the exposure to corruption/ bitflips.

What I did wonder was why compress the filesize as an exponent? One would imagine that representing as a floating-point exponent would take lots of cycles, pretty much as many bits, and have nasty precision inaccuracies at larger sizes.

SomeoneFromCA 890 days ago

Interesting paper, but has some technical errors. First of all, they keep mentioning SRAM+ECC, instead of DRAM+ECC; you cannot use gcj to inspect assembly code generated for Java method, as it will be completely different from the code generated by Hotspot; you do not need all that acrobatics to get disasm of the method, you could just add an infinite loop to the code and attach gdb to the JVM process and inspect the code or dump the core.

MertsA 890 days ago

Disclaimer: I work at Meta and I know a couple of the authors of the paper but my work is completely unrelated to the subject of the paper.

That's not a technical error, they mean SRAM in the CPU itself. You're right about gcj but that's kind of a moot point when investigating some reproducible CPU bug like this. The paper mentions all the acrobatics they went through when trying to find the root cause but if gcj would have been practical then it also would have been immediately clear if the gcj output reproduced the error or not. If it didn't reproduce, no big deal, try another approach. You might be right about it being easier to root cause with gdb directly but I'm not so sure. Starting out, you have no idea which instructions under what state are triggering the issue so you'd be looking for a needle in a haystack. A crashdump or gdb doesn't let you bisect that haystack so good luck finding your needle.

SomeoneFromCA 890 days ago

GCJs implementation could be so vastly different from Hotspot, you could as well rewrite it in C and check if it is failing or not. ChatGPT would generate testcase within a minute.

It all depends how good you are with x64 assembly. If you are good enough, you can easily deduce what the instructions at the location do, and can potentially simply copy-paste into an asm file, compile it and check result. Would be much faster to me.

Bluntly speaking, people who are not familiar with low-level debugging make an honest and succesful attempt to investigate a low-level issue. A seasoned kernel developer or reverse engineer would have just used gdb straight away.

MertsA 890 days ago

>A seasoned kernel developer

I think you should take another look at the author list. Chris Mason counts as a seasoned kernel developer in my book. Either way I think you're missing the point. Yes gcj would be different, but there's a decent chance it could hand you a binary that reproduces the issue that you can bisect to the root cause from there. It's one thing to run it through gcj and see if it reproduces, rewriting it in C is a ton of work compared to gcj for something that might not pan out.

SomeoneFromCA 890 days ago

I am not missing the point, as I do not believe in authorities and someone else's evaluations of skill level of yet another person. To rewrite a simple exponentiation in C would not cause "lots of work", and pinpointing the culprit, the exponentiation does not require any gdb debugging and disassembling. In fact, just knowing that exponentiation has caused that suggests faulty hardware and not further investigation required.

You should probably invite these people themselves to the discussion instead of speaking on their behalf. Not productive.