Hacker News new | ask | show | jobs
by moremetadata 1202 days ago
Moral of the story?

Upgrade to DDR5 ram the latest standard which has on-die ECC memory but is not as good at spotting bit flips unlike proper ECC memory with a separate extra data correction chip.

https://en.wikipedia.org/wiki/DDR5_SDRAM#:~:text=Unlike%20DD....

Whilst Proper ECC ram chips and motherboards exist, I'm surprised that a cheaper but equally as good as Proper ECC solution doesn't exist although I know some would argue that DDR5 is a step in the right direction of a marathon.

I guess the markets know best and chase the numbers, assuming they are also using Proper ECC memory, binary coded decimal and not floating point arithmetic which introduces errors, something central banks have been using for decades?

https://en.wikipedia.org/wiki/Floating-point_error_mitigatio...

3 comments

Also from your link:

“There still exist non-ECC and ECC DDR5 DIMM variants; the ECC variants have extra data lines to the CPU to send error-detection data, letting the CPU detect and correct errors that occurred in transit.”

Intel and Asrock released a NUC with in-band ECC, equally as good at protecting your data but with performance hits.

https://www.anandtech.com/show/18732/asrock-industrial-nucs-...

DDR5 has enough ECC on chip to make errors effectively impossible. It doesn't provide error data to the CPU, though, so errors in transit can still occur. This is really unlikely, though, and anything not mission-critical will no longer need the extra ECC computation on the CPU-side. (DDR5 encapsulates the memory controller).
> This is really unlikely, though

It happens quite often as a result of dust in the contacts when the memory was installed or weak solder on the chips or sockets or bad capacitors etc.

None of which is that likely on machines in good working order, but many are not. And you can go from one to the other at any time as a result of a power spike or a cooling failure.

source on that ? Did anyone tested that ?

> This is really unlikely, though, and anything not mission-critical will no longer need the extra ECC computation on the CPU-side.

ECC computation is done in hardware anyway

I meant the memory controller on the CPU side won't need to implement it. Obviously, full DDR5-ECC hardware exists, but the onchip ECC as a whole makes bit flips far less likely than DDR4. There's not much of a need for the complete set on consumer hardware.

Of course this is assuming random cosmic ray bit flips, not faulty hardware. And it's speaking cost-wise from the manufacturer's perspective. I'd personally like full ECC to just be the standard.

> This is really unlikely, though,

I think you can say that because people are not routinely monitoring their surroundings for ionizing radiation.

If this were to change, I think we can start to identify some of those military locations which could be interfering with equipment, that would then expose the weakness of DDR5.