Hacker News new | ask | show | jobs
by dis-sys 1340 days ago
interesting, so they are actually using Ryzen with ECC RAM (when most people would be using Ryzen with non-ECC RAM) and that saved them from some seriously corrupted data written back to their persistent storage.

wondering is it common for people to specifically monitor their system log for correctable error related messages, do they consider the memory is faulty when there are correctable errors?

5 comments

> do they consider the memory is faulty when there are correctable errors?

It depends on the frequency. Occasional CEs are somewhat expected (on a large enough scale) and one can live with them, after all that's what ECC is for. When CEs start happening frequently on one machine, most likely a DIMM is going bad and will worsen over time, so one should replace it.

thanks for the info. this is exactly what I am doing. it does provide extra peace in mind knowing that my odds of having silent corruption is further reduced by doing such monitoring.
Anyone who uses ECC DIMMs definitely MUST monitor what the memory controllers report to make optimal use of it.

However, you can also set a policy what the Linux kernel will/should do on its own when an ECC error condition has been detected: The `edac_core` module has options such as `edac_mc_panic_on_ue`, which, if set, will trigger a kernel panic upon detecting an Uncorrectable Error in system memory. Depending on your use case, this can be better or worse than just logging it.

I do regularly look at dmesg on my Ryzen Threadripper system with ECC RAM.

Random correctable errors are rare but they do happen - at least if you overclock your RAM ("gaming" RAM often is already pre-overclocked). Might just be confirmation bias but I noticed ECC errors and then later heard there was a solar flare around the time.

I also replaced a DIMM that was starting to get more frequent ECC errors once. As OP found the mapping for consumer boards requires to some trial and error - my motherboard documentation even had a table but the numbering was different from the one used in Linux :/

I don't think I'm ever going to use a non-ECC desktop again, the additional cost is not that high for the extra safety against silent corruption.

I have a script to relay new dmesg events into my (xfce) desktop session using libnotify. I figure others may find it useful, too:

https://paste.debian.net/1257030/

It gets started via xdg autostart here, and will tell me about new "stuff" that happens. For it to work, your user will have to have permission to read the kernel event log/debug ringbuffer. I achieve that by setting the appropriate sysctl:

    kernel.dmesg_restrict = 0
I just keep `dmesg -w` running in a terminal window :)
> I don't think I'm ever going to use a non-ECC desktop again, the additional cost is not that high for the extra safety against silent corruption.

same here, but sadly you don't get to choose what you get when purchasing laptops, it is simply impossible to get ECC ram if you run mbp.

Even in PC-land, I don't think there's much choice when it comes to ECC in laptops.

I can only remember one model of Lenovo that had an option to have a Xeon CPU with ECC RAM. I've never seen one with an AMD CPU.

There are many Dell, HP and Lenovo laptops with ECC memory, but all are very expensive (e.g. $2500 ... $7500 in a usable configuration, even if the prices may start a little under $2000, but in a useless configuration).

When browsing their Web sites, these models are not obvious, because they are in the section for "enterprise" laptops, listed under "mobile workstations".

For me as a nerd, yes.

Zfs based Nas with ECC, smart check for HDD, system check including ECC too.

> (when most people would be using Ryzen with non-ECC RAM)

Is this true for servers? If I had a Ryzen based server, I’d use ECC RAM.

I think it mainly applies to non-server systems where a) most people don't even know about ECC and b) non-server Ryzens can only use UDIMMs but there are not that many ECC UDIMMs available (probably just because of low demand) so you probably need to make some tradeoffs like paying more (more than +15% markup for the 9th bit) and won't have as fast chips available at the high end.

I think it is also not required for consumer Ryzen mainboards to support ECC but at least for the high end ones many do.

Because only some Ryzen motherboards support ECC, one must always read carefully the technical specifications before buying a motherboard.

There are many ASUS and ASRock AM5 (and AM4) motherboards that support ECC, and for those it is typically writen in the memory section "supports ECC & Non-ECC unbuffered DIMMs".

When nothing like this is written, then ECC is not supported.

Moreover, all the motherboards with ECC support must have in the "Advanced" BIOS Setup an option for enabling ECC, which must be used, because the default is always to disable ECC.

With the Ryzen 7000 series there is an improvement over the previous Ryzen series, because in their specification it is written clearly that ECC is supported. Previously, the ECC support was not explicit, even if, unlike Intel they did not disable ECC, so you could hope that it works fine.

Now Intel no longer disables ECC in many Raptor Lake and Alder Lake desktop CPUs, but the motherboards with ECC support for Intel are much harder to find (because they must use a special workstation chipset, while for AMD it is enough to add the PCB traces for the ECC bits).

> Moreover, all the motherboards with ECC support must have in the "Advanced" BIOS Setup an option for enabling ECC, which must be used, because the default is always to disable ECC.

On both of my Ryzen ASUS motherboards (WS X570-ACE, and ROG STRIX X399-E GAMING) this is not true. I just slapped the DIMMs in there and powered the box up.

dmidecode thinks that the system has ECC enabled:

  dmidecode --type memory | grep -e "Error Correction"
   Error Correction Type: Multi-bit ECC
'amd64_edac' doesn't complain about being loaded on a non-ECC system.

The closed-source version of memtest86 reports that it's running on an ECC-enabled system.

This may depend on the BIOS version, even on the same motherboard.

I also have the same ASUS Pro WS X570-ACE (bought in Q4 2019), which I use with ECC DIMMs, and I had to enable in BIOS the support for ECC.

In any case, one should always check for such an option in the BIOS, to avoid surprises.