| HN Mirror

> I've looked into using some consumer AMD CPUs that theoretically work with ECC memory, and a couple motherboards from ASUS and ASRock theoretically support ECC, but I've heard that it's hard to figure out if it's really working.

It's not obvious, and there is so much misinformation on the Web for sure. But it shouldn't be that difficult at least on Linux (unlike BSD, and I'm speaking as a BSD user). Linux has the best ECC support for consumer AMD CPUs users, thanks to the kernel driver amd64_edac [0][1]. It accurately reports the ECC status by querying the registers inside the memory controller, giving a reliable indication of the ECC status. If dmesg says "EDAC amd64: Node 0: DRAM ECC enabled", you can be pretty sure that ECC is indeed enabled.

This driver also allows you to change the ECC memory scrubbing settings (but make sure the kernel is up-to-date, see [2][3]). Memory scrubbing, similar to RAID disk scrubbing, is a process of reading out the data and checking its integrity in the background (otherwise unused data with a recoverable error may never be checked, until an unrecoverable error occurs). Some consumer-grade motherboards don't show the memory scrubbing options in the firmware, and you often want to select a more aggressive scrub rate than the slow default, so the kernel driver is also really helpful.

And speaking of testing whether actual errors can be reported and corrected... In a proper production environment, ECC is tested by a technique called "data poisoning", or "error injection". It allows the OS to inject ECC errors directly via the memory controller for confirmation. To do that, this feature must be enable by firmware, and the OS must also provide the necessary driver. Unfortunately, while server motherboards always have an enable/disable option, some consumer motherboards do not. And worse, it's not supported by Linux as far as I know. Theoretically one can read the AMD CPU datasheet called the BIOS and Kernel Developer’s Guide and write your own tool, unfortunately all datasheets post-Ryzen are under NDA.

But all is not lost. There is a proprietary tool, memtest86, which claims to support ECC injection [4]. This should be helpful (though I've never tried it personally). Alternatively, on customer-grade hardware, one can simply check ECC by overclocking the memory and adjusting its timings to the edge of instability, then running a stress test like Prime95 (the Unix version is called mprime). In my experience, if the memory is sufficiently overclocked, a single test only takes 10 minutes.

Finally there is so much misinformation on the Web. For example, one article showed that Linux kills a process via SIGBUS when an uncorrectable ECC error occurs, instead of triggering a kernel panic. And it went to conclude that ECC is not fully functional - it was just pure misinformation, Linux only triggers a kernel panic when kernel memory has an uncorrectable error, for user memory, SIGBUS is the expected behavior. Another case of misinformation is due to the lack of proper error decoding on BSD. When an ECC error occurs, a Machine Check Exception is generated by the CPU. On Linux, it will be correctly decoded and recorded. But on FreeBSD, so far there's no decoder, leaving you a mysterious MCA error in dmesg. For example, a correctable DRAM ECC error will be reported as "L3 cache error" (which made many people to falsely believe that Ryzen's ECC was not working on FreeBSD). I've compared the MCE/MCA error code for "L3 cache error" on FreeBSD with Linux's "correctable ECC" error code - they're identical.

> He eventually got the motherboard firmware vendor to admit that these were handled "firmware-first" meaning intentionally not reported.

This is the real problem. Most consumer motherboards don't do this, but it can be a headache when the firmware vendor screwed it up... Some do it by default with an option to disable it, and some cannot be disabled. Also, many server motherboards with "firmware-first" ECC handling hides the error to the OS, but still report ECC errors via the IPMI console.

> Consumer DDR5 is all ECC, out of desperate necessity, but it doesn't report anything, so you can't tell how close to the sun it's flying. Rowhammer just keeps coming back.

Saying "DDR5 is all ECC" is misleading. DDR5's "on-die ECC" should only be seen as an internal implementation detail to increase the chip yield, rather than a full form of data integrity protection. Real ECC is always performed by the memory controller. For DDR5, there still exists separate ECC versions for server applications, just like all previous DDR generations.

[0] https://www.kernel.org/doc/html/latest/admin-guide/ras.html

[1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/lin...

[2] https://unix.stackexchange.com/questions/593060/how-do-i-ena...

[3] https://lore.kernel.org/linux-edac/a9cdf7c2-868a-8f67-ac4e-c...

[4] https://www.memtest86.com/ecc.htm