Hacker News new | ask | show | jobs
by AnthonyMouse 1206 days ago
ECC is error correcting. A bit gets flipped and it not only detects it but fixes it. Two bits get flipped and it can at least detect it and panic the machine immediately instead of corrupting your data.

Without it the corruption is silent. Then this kind of thing happens:

https://news.ycombinator.com/item?id=35026440

Which is another reason not to solder the storage either.

Suppose you have a system board with bad soldered memory and you want to copy your data off of it onto the new one. Well, the memory is flipping random bits as it's copying, but the flash chips are permanently attached to the same board as the bad memory.

Otherwise it would have been just a support ticket; now it's something worse.

1 comments

>ECC is error correcting. A bit gets flipped and it not only detects it but fixes it. Two bits get flipped and it can at least detect it and panic the machine immediately instead of corrupting your data.

I did neglect to mention that ECC by-definition can correct errors, but I wonder if what's making people upset with my comment is the implication that ECC can't detect all errors.

But it's true: ECC can't detect all bitflips, and in fact there's at least one study[1] that suggests quite a lot of memory errors go entirely undetected even with ECC.

Silent corruption does in fact occur even with ECC and it may not even be particularly rare, even though it is rarer than typical single/double-bit flips. Of course, the majority of desktops use non-ECC RAM and it's mostly fine, so I assume this is only ever going to matter in production workloads, and exactly what impact it has is hard to gauge.

[1]: https://pages.cs.wisc.edu/~remzi/Classes/739/Fall2018/Papers...

Maybe the issue is that undetectable errors are possible, but if the system is in such a bad way that they're happening at any rate, you'll also be getting quite a lot of the detectable ones and then get prompt notice that something is wrong.

Whereas without ECC you could have silent data corruption for years and only discover it after it gets severe enough to warrant a manual investigation, after the damage has already propagated to your backups.

> Of course, the majority of desktops use non-ECC RAM and it's mostly fine, so I assume this is only ever going to matter in production workloads, and exactly what impact it has is hard to gauge.

There are two reasons it's useful. One is the cosmic ray random bit flip that happens even to hardware in good condition, and then ECC can usually detect and correct it, but that's less common and more important for production workloads.

The other is, your hardware is experiencing a higher than average number of random bit flips, and then ECC gives you immediate notice when this starts happening instead of letting it sow chaos until something crashes so hard you take notice.