Hacker News new | ask | show | jobs
by DCKing 2077 days ago
I guess what I don't understand then is what big advantage "enterprise ECC" has left over this DDR5 "non-enterprise ECC". (Seriously, why is ECC "enterprise"? Everybody wins with memory error correction.) If regular DDR5 can correct single bitflips, it is on par in correction capabilities with "enterprise ECC" DDR4.

Maybe this won't allow for the detection of multiple flips, and maybe won't even report single bit flips to the OS (it'll just fix them silently). I suppose there's no big need to support detection and reporting for the vast majority of use cases. Ryan Smith at Anandtech in the link above says as much: "Between the number of bits per chip getting quite high, and newer nodes getting successively harder to develop, the odds of a single-bit error is getting uncomfortably high. So on-die ECC is meant to counter that, by transparently dealing with single-bit errors."

But for my purposes, if just the correction capabilities are on par with DDR4 ECC I'd be absolutely fine with that. And I guess that goes for many people. Even while using ECC memory now at home, I'm not monitoring the correction statistics and I'm guessing few people do in general. It might as well be silent today if you ask me.

4 comments

> I guess what I don't understand then is what big advantage "enterprise ECC" has left over this DDR5 "non-enterprise ECC"

ECC should be end-to-end, so it detects and (hopefully) corrects errors anywhere along the path, not just within a chip.

Step 1 of handling a lot of ECC correction events is to reseat the DIMM, because often it's just an issue with the connection, not actually a memory defect.

And you may not care too much about reports of correction events, but you definitely want to see correction failures reported - the point is, after all, to avoid corruption.

Most servers have chipkill ECC that can survive an entire 4-bit chip going bad so that's more powerful than classic SECDED. I don't know how often chipkill kicks in though.
I don't know what "enterprise ECC" means, but there are certainly grades of protection, from single bit error detection (parity) thru triple error correct quadruple error detect, and inline vs non-inline correction schemes. (for the latter, the machine has to stop, go back, fix the error, & resume, potentially at significant performance cost)
On SSD, we never receive each error corrected notification maybe because it causes too many errors everyday. I expect also DRAM may go such device.