|
|
|
|
|
by jchw
1208 days ago
|
|
>ECC is error correcting. A bit gets flipped and it not only detects it but fixes it. Two bits get flipped and it can at least detect it and panic the machine immediately instead of corrupting your data. I did neglect to mention that ECC by-definition can correct errors, but I wonder if what's making people upset with my comment is the implication that ECC can't detect all errors. But it's true: ECC can't detect all bitflips, and in fact there's at least one study[1] that suggests quite a lot of memory errors go entirely undetected even with ECC. Silent corruption does in fact occur even with ECC and it may not even be particularly rare, even though it is rarer than typical single/double-bit flips. Of course, the majority of desktops use non-ECC RAM and it's mostly fine, so I assume this is only ever going to matter in production workloads, and exactly what impact it has is hard to gauge. [1]: https://pages.cs.wisc.edu/~remzi/Classes/739/Fall2018/Papers... |
|
Whereas without ECC you could have silent data corruption for years and only discover it after it gets severe enough to warrant a manual investigation, after the damage has already propagated to your backups.
> Of course, the majority of desktops use non-ECC RAM and it's mostly fine, so I assume this is only ever going to matter in production workloads, and exactly what impact it has is hard to gauge.
There are two reasons it's useful. One is the cosmic ray random bit flip that happens even to hardware in good condition, and then ECC can usually detect and correct it, but that's less common and more important for production workloads.
The other is, your hardware is experiencing a higher than average number of random bit flips, and then ECC gives you immediate notice when this starts happening instead of letting it sow chaos until something crashes so hard you take notice.