Hacker News new | ask | show | jobs
by ploxiln 1317 days ago
It can be very hard to get memory error reporting these days.

Bryan Cantrill mentions in one of his talks that Joyent had a datacenter where uncorrectable errors were sporadically halting servers, but no correctable errors were ever counted. He eventually got the motherboard firmware vendor to admit that these were handled "firmware-first" meaning intentionally not reported.

I've looked into using some consumer AMD CPUs that theoretically work with ECC memory, and a couple motherboards from ASUS and ASRock theoretically support ECC, but I've heard that it's hard to figure out if it's really working.

Testing whether a motherboard firmware actually reports ECC errors ... probably doesn't really happen, because it seems to work fine if it doesn't report them, and the company wants to just finish QA and ship. And the rare motherboard that does report errors correctly is more likely to trigger bugs in higher layers that were never actually tested before. And there's pressure to disable or hide this feature to reduce pesky customer support costs. No one else reports any errors, why does your product report errors, I want a replacement, etc.

Consumer DDR5 is all ECC, out of desperate necessity, but it doesn't report anything, so you can't tell how close to the sun it's flying. Rowhammer just keeps coming back.

2 comments

I've certainly seen ECC error reporting work, although it was a little sketchy, but that was xeon 2600 v1-4, which is dated now and server platform anyway.

With a fleet of 2000 servers with 64GB to 768GB each of DDR3 and DDR4, most days we didn't see any errors detected unless we currently had a system with a DIMM that would throw a (correctable) error once a day or so. Reporting was always kind of weird, we'd get OS logging once an hour if there were any errors, which is mostly fine, except when a system goes from a couple errors an hour to thousands per minute: machine check exceptions are quite expensive to process and kill throughput if they're happening a lot, but no idea why the system is misbehaving until the next reporting interval. Of course, those thousands of errors really tweak the average rate. We'd replace RAM for more than one uncorrectable, or uncorrectable after correctables, or when we had time, too many correctables (100+ per day). A lot of servers would show a couple correctable errors once and then be fine, but some did become periodic or escalate.

On consumer platforms, you should be able to test if ECC reporting is happening by setting the memory voltage too low or the timings too fast so that you're likely to have errors. If you can trigger an uncorrectable error, you should be able to trigger a correctable too.

On die ECC is better than nothing, I guess, but it's kind of like digital TV --- it's good until it's not, with no indication you're close to the edge. Also, no help if there's problems between the CPU and the RAM.

The “ECC” on dDR5 does _not_ replace regular ECC. Please see Ian’s explanation: https://youtu.be/XGwcPzBJCh0
> I've looked into using some consumer AMD CPUs that theoretically work with ECC memory, and a couple motherboards from ASUS and ASRock theoretically support ECC, but I've heard that it's hard to figure out if it's really working.

It's not obvious, and there is so much misinformation on the Web for sure. But it shouldn't be that difficult at least on Linux (unlike BSD, and I'm speaking as a BSD user). Linux has the best ECC support for consumer AMD CPUs users, thanks to the kernel driver amd64_edac [0][1]. It accurately reports the ECC status by querying the registers inside the memory controller, giving a reliable indication of the ECC status. If dmesg says "EDAC amd64: Node 0: DRAM ECC enabled", you can be pretty sure that ECC is indeed enabled.

This driver also allows you to change the ECC memory scrubbing settings (but make sure the kernel is up-to-date, see [2][3]). Memory scrubbing, similar to RAID disk scrubbing, is a process of reading out the data and checking its integrity in the background (otherwise unused data with a recoverable error may never be checked, until an unrecoverable error occurs). Some consumer-grade motherboards don't show the memory scrubbing options in the firmware, and you often want to select a more aggressive scrub rate than the slow default, so the kernel driver is also really helpful.

And speaking of testing whether actual errors can be reported and corrected... In a proper production environment, ECC is tested by a technique called "data poisoning", or "error injection". It allows the OS to inject ECC errors directly via the memory controller for confirmation. To do that, this feature must be enable by firmware, and the OS must also provide the necessary driver. Unfortunately, while server motherboards always have an enable/disable option, some consumer motherboards do not. And worse, it's not supported by Linux as far as I know. Theoretically one can read the AMD CPU datasheet called the BIOS and Kernel Developer’s Guide and write your own tool, unfortunately all datasheets post-Ryzen are under NDA.

But all is not lost. There is a proprietary tool, memtest86, which claims to support ECC injection [4]. This should be helpful (though I've never tried it personally). Alternatively, on customer-grade hardware, one can simply check ECC by overclocking the memory and adjusting its timings to the edge of instability, then running a stress test like Prime95 (the Unix version is called mprime). In my experience, if the memory is sufficiently overclocked, a single test only takes 10 minutes.

Finally there is so much misinformation on the Web. For example, one article showed that Linux kills a process via SIGBUS when an uncorrectable ECC error occurs, instead of triggering a kernel panic. And it went to conclude that ECC is not fully functional - it was just pure misinformation, Linux only triggers a kernel panic when kernel memory has an uncorrectable error, for user memory, SIGBUS is the expected behavior. Another case of misinformation is due to the lack of proper error decoding on BSD. When an ECC error occurs, a Machine Check Exception is generated by the CPU. On Linux, it will be correctly decoded and recorded. But on FreeBSD, so far there's no decoder, leaving you a mysterious MCA error in dmesg. For example, a correctable DRAM ECC error will be reported as "L3 cache error" (which made many people to falsely believe that Ryzen's ECC was not working on FreeBSD). I've compared the MCE/MCA error code for "L3 cache error" on FreeBSD with Linux's "correctable ECC" error code - they're identical.

> He eventually got the motherboard firmware vendor to admit that these were handled "firmware-first" meaning intentionally not reported.

This is the real problem. Most consumer motherboards don't do this, but it can be a headache when the firmware vendor screwed it up... Some do it by default with an option to disable it, and some cannot be disabled. Also, many server motherboards with "firmware-first" ECC handling hides the error to the OS, but still report ECC errors via the IPMI console.

> Consumer DDR5 is all ECC, out of desperate necessity, but it doesn't report anything, so you can't tell how close to the sun it's flying. Rowhammer just keeps coming back.

Saying "DDR5 is all ECC" is misleading. DDR5's "on-die ECC" should only be seen as an internal implementation detail to increase the chip yield, rather than a full form of data integrity protection. Real ECC is always performed by the memory controller. For DDR5, there still exists separate ECC versions for server applications, just like all previous DDR generations.

[0] https://www.kernel.org/doc/html/latest/admin-guide/ras.html

[1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/lin...

[2] https://unix.stackexchange.com/questions/593060/how-do-i-ena...

[3] https://lore.kernel.org/linux-edac/a9cdf7c2-868a-8f67-ac4e-c...

[4] https://www.memtest86.com/ecc.htm