| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by IgorPartola 1999 days ago
	To someone on HN who isn’t familiar with what ECC does that explains nothing about how ECC works and how it could have prevented these situations. Or how often they really happen.

3 comments

simias 1999 days ago

The problem is that, if you don't have ECC to detect the errors, it's very hard to know what exactly caused a random, non-reproducible crash. Especially in kernel mode where there's little memory protection and basically any driver could be writing anywhere at any time.

I can understand Linus's frustration from that point of view: without ECC RAM when you get some super weird crash report where some pointer got corrupted for no apparent reason you can't be sure if it's was just a random bitflip or if it's actually hiding a bigger problem.

link

andi999 1999 days ago

You could run memtest on a pc without ecc for a couple of days and to estimate the error rate, or not?

link

fuster 1999 days ago

Pretty sure most memory test tools like memtest86 write the memory and then read it back shortly thereafter in relatively small blocks. This makes the window for errors to be introduced dramatically smaller. Most memory in a computer is not being continually rewritten under normal use.

link

simias 1998 days ago

If you manage to replicate bitflips every few days your RAM is broken.

It's the "once every other year" type of bitflip that's the problem. The proverbial "cosmic ray" hitting your DRAM and flipping a bit. That will be caught by ECC but it'll most likely remain a total mystery if it causes your non-ECC hardware to crash.

link

zlynx 1998 days ago

It isn't only cosmic rays. Regular old radiation can also cause it. I've read about a server that had many repeated problems and the techs replaced the entire motherboard at one point.

Then one of them brought in his personal Geiger counter and found the radiation coming off the steel in that rack case was significantly higher than background.

You may never know when the metal you use was recycled from something used to hold radioactive materials.

link

andi999 1998 days ago

In the mean time I read here the rate to be around 1 bit flip per 1 GB per month. So in an 120 GB System that shd be about 1 flip every 6h. But then this number might be wrong.

link

chalst 1999 days ago

From https://en.m.wikipedia.org/wiki/ECC_memory -

> A large-scale study based on Google's very large number of servers was presented at the SIGMETRICS/Performance ’09 conference.[6] The actual error rate found was several orders of magnitude higher than the previous small-scale or laboratory studies, with between 25,000 (2.5 × 10−11 error/bit·h) and 70,000 (7.0 × 10−11 error/bit·h, or 1 bit error per gigabyte of RAM per 1.8 hours) errors per billion device hours per megabit. More than 8% of DIMM memory modules were affected by errors per year

link

reader_mode 1999 days ago

It takes 5 seconds to Google ECC memory if you're really interested and if you're working on kernel related stuff you 99.9999% know what it is.

link

nix23 1999 days ago

To someone on HN who isn’t familiar with what Google does that explains nothing about how Google works ;)

link

TheCoelacanth 1999 days ago

Google is like an evil version of Duck Duck Go.

link

Danieru 1999 days ago

Nah, to Google is just a generic verb. For example I too do all my googling at Duck Duck Go.

Hi alphabet lawyers.

link

vorticalbox 1999 days ago

I believe there was a suit against alphabet about this very thing.

They argued that 'Google' has now become a verb meaning 'to search the Internet for' and as such alphabet should have the name taken away.

link

IgorPartola 1999 days ago

Right. My point that TFA serves zero purpose to most people on here. Those that know how ECC works already know that it is a must have. Those that don't will learn very little from the post because it fails to explain what ECC is and why you need it aside from general statements about memory errors. It will reaffirm for those that know about what ECC RAM is that it's a good idea, but they already know it anyways. It reads a lot like an article about why vitamin C is a good thing.

link