Hacker News new | ask | show | jobs
by eloy 1995 days ago
He does explain it:

> We have decades of odd random kernel oopses that could never be explained and were likely due to bad memory. And if it causes a kernel oops, I can guarantee that there are several orders of magnitude more cases where it just caused a bit-flip that just never ended up being so critical.

It might be false, but I think it's a reasonable assumption.

1 comments

To someone on HN who isn’t familiar with what ECC does that explains nothing about how ECC works and how it could have prevented these situations. Or how often they really happen.
The problem is that, if you don't have ECC to detect the errors, it's very hard to know what exactly caused a random, non-reproducible crash. Especially in kernel mode where there's little memory protection and basically any driver could be writing anywhere at any time.

I can understand Linus's frustration from that point of view: without ECC RAM when you get some super weird crash report where some pointer got corrupted for no apparent reason you can't be sure if it's was just a random bitflip or if it's actually hiding a bigger problem.

You could run memtest on a pc without ecc for a couple of days and to estimate the error rate, or not?
Pretty sure most memory test tools like memtest86 write the memory and then read it back shortly thereafter in relatively small blocks. This makes the window for errors to be introduced dramatically smaller. Most memory in a computer is not being continually rewritten under normal use.
If you manage to replicate bitflips every few days your RAM is broken.

It's the "once every other year" type of bitflip that's the problem. The proverbial "cosmic ray" hitting your DRAM and flipping a bit. That will be caught by ECC but it'll most likely remain a total mystery if it causes your non-ECC hardware to crash.

It isn't only cosmic rays. Regular old radiation can also cause it. I've read about a server that had many repeated problems and the techs replaced the entire motherboard at one point.

Then one of them brought in his personal Geiger counter and found the radiation coming off the steel in that rack case was significantly higher than background.

You may never know when the metal you use was recycled from something used to hold radioactive materials.

In the mean time I read here the rate to be around 1 bit flip per 1 GB per month. So in an 120 GB System that shd be about 1 flip every 6h. But then this number might be wrong.
From https://en.m.wikipedia.org/wiki/ECC_memory -

> A large-scale study based on Google's very large number of servers was presented at the SIGMETRICS/Performance ’09 conference.[6] The actual error rate found was several orders of magnitude higher than the previous small-scale or laboratory studies, with between 25,000 (2.5 × 10−11 error/bit·h) and 70,000 (7.0 × 10−11 error/bit·h, or 1 bit error per gigabyte of RAM per 1.8 hours) errors per billion device hours per megabit. More than 8% of DIMM memory modules were affected by errors per year

It takes 5 seconds to Google ECC memory if you're really interested and if you're working on kernel related stuff you 99.9999% know what it is.
To someone on HN who isn’t familiar with what Google does that explains nothing about how Google works ;)
Google is like an evil version of Duck Duck Go.
Nah, to Google is just a generic verb. For example I too do all my googling at Duck Duck Go.

Hi alphabet lawyers.

I believe there was a suit against alphabet about this very thing.

They argued that 'Google' has now become a verb meaning 'to search the Internet for' and as such alphabet should have the name taken away.

Right. My point that TFA serves zero purpose to most people on here. Those that know how ECC works already know that it is a must have. Those that don't will learn very little from the post because it fails to explain what ECC is and why you need it aside from general statements about memory errors. It will reaffirm for those that know about what ECC RAM is that it's a good idea, but they already know it anyways. It reads a lot like an article about why vitamin C is a good thing.