Hacker News new | ask | show | jobs
by MichaelZuo 1488 days ago
That's not the type of ECC the parent was talking about. That's because the densities and clock rates are so high for DDR5 that it needs ECC to function properly, but like most standards the minimal implementation is really quite watered down. It doesn't correct the entire range of bitflips that a server with ECC RAM does.
1 comments

Disagree. Parent was discussing the need to reboot after a system has been on for a long number of hours. The failure mode, assuming it's related to the DRAM, would be an accumulation of bit-flips in the DRAM. Every memory has some FIT/Megabit rate. The on-die ECC added in DDR5 spec will be highly effective in addressing this failure mode.

Channel ECC is the ECC type most directly relevant for high clock rates and signal integrity aspects. I agree with you that Channel ECC becomes a practical requirement to meet the interface transaction rates of DDR5. It is also true that channel ECC is not mandatory in DDR5 and is not implemented by mainstream CPU platforms (like previous DDR generations).

If the on-die ECC reduces the error rate but the lack of standard channel ECC increases the error rate, because of the much more demanding signals, then it's not clear at all that the overall rate of error will lower.

In fact it could very well be higher depending on how the physical module is designed.

I imagine some portion of bit-flip induced reboots are due to the actual DRAM chips, but also some portion will be due to everything else that can bit flip both on the memory module itself and in the interconnect.

I haven't seen anything yet to say that DRAM chip bit flips will be in the majority.