|
|
|
|
|
by viraptor
1317 days ago
|
|
I wonder how well that paper holds up over a decade later. It reviewed DDR1/2 in 2009. I like to ask people running ECC to check their error counters. (on Linux `edac-util -rfull`) From my very non-scientific survey, memory errors seem to happen significantly less often than this paper would lead you to believe. Then again, running ECC in the first place indicates better hardware than non-ECC, so that's a likely bias. |
|
Bryan Cantrill mentions in one of his talks that Joyent had a datacenter where uncorrectable errors were sporadically halting servers, but no correctable errors were ever counted. He eventually got the motherboard firmware vendor to admit that these were handled "firmware-first" meaning intentionally not reported.
I've looked into using some consumer AMD CPUs that theoretically work with ECC memory, and a couple motherboards from ASUS and ASRock theoretically support ECC, but I've heard that it's hard to figure out if it's really working.
Testing whether a motherboard firmware actually reports ECC errors ... probably doesn't really happen, because it seems to work fine if it doesn't report them, and the company wants to just finish QA and ship. And the rare motherboard that does report errors correctly is more likely to trigger bugs in higher layers that were never actually tested before. And there's pressure to disable or hide this feature to reduce pesky customer support costs. No one else reports any errors, why does your product report errors, I want a replacement, etc.
Consumer DDR5 is all ECC, out of desperate necessity, but it doesn't report anything, so you can't tell how close to the sun it's flying. Rowhammer just keeps coming back.