| It can be very hard to get memory error reporting these days. Bryan Cantrill mentions in one of his talks that Joyent had a datacenter where uncorrectable errors were sporadically halting servers, but no correctable errors were ever counted. He eventually got the motherboard firmware vendor to admit that these were handled "firmware-first" meaning intentionally not reported. I've looked into using some consumer AMD CPUs that theoretically work with ECC memory, and a couple motherboards from ASUS and ASRock theoretically support ECC, but I've heard that it's hard to figure out if it's really working. Testing whether a motherboard firmware actually reports ECC errors ... probably doesn't really happen, because it seems to work fine if it doesn't report them, and the company wants to just finish QA and ship. And the rare motherboard that does report errors correctly is more likely to trigger bugs in higher layers that were never actually tested before. And there's pressure to disable or hide this feature to reduce pesky customer support costs. No one else reports any errors, why does your product report errors, I want a replacement, etc. Consumer DDR5 is all ECC, out of desperate necessity, but it doesn't report anything, so you can't tell how close to the sun it's flying. Rowhammer just keeps coming back. |
With a fleet of 2000 servers with 64GB to 768GB each of DDR3 and DDR4, most days we didn't see any errors detected unless we currently had a system with a DIMM that would throw a (correctable) error once a day or so. Reporting was always kind of weird, we'd get OS logging once an hour if there were any errors, which is mostly fine, except when a system goes from a couple errors an hour to thousands per minute: machine check exceptions are quite expensive to process and kill throughput if they're happening a lot, but no idea why the system is misbehaving until the next reporting interval. Of course, those thousands of errors really tweak the average rate. We'd replace RAM for more than one uncorrectable, or uncorrectable after correctables, or when we had time, too many correctables (100+ per day). A lot of servers would show a couple correctable errors once and then be fine, but some did become periodic or escalate.
On consumer platforms, you should be able to test if ECC reporting is happening by setting the memory voltage too low or the timings too fast so that you're likely to have errors. If you can trigger an uncorrectable error, you should be able to trigger a correctable too.
On die ECC is better than nothing, I guess, but it's kind of like digital TV --- it's good until it's not, with no indication you're close to the edge. Also, no help if there's problems between the CPU and the RAM.