To someone on HN who isn’t familiar with what ECC does that explains nothing about how ECC works and how it could have prevented these situations. Or how often they really happen.
The problem is that, if you don't have ECC to detect the errors, it's very hard to know what exactly caused a random, non-reproducible crash. Especially in kernel mode where there's little memory protection and basically any driver could be writing anywhere at any time.
I can understand Linus's frustration from that point of view: without ECC RAM when you get some super weird crash report where some pointer got corrupted for no apparent reason you can't be sure if it's was just a random bitflip or if it's actually hiding a bigger problem.
Pretty sure most memory test tools like memtest86 write the memory and then read it back shortly thereafter in relatively small blocks. This makes the window for errors to be introduced dramatically smaller. Most memory in a computer is not being continually rewritten under normal use.
If you manage to replicate bitflips every few days your RAM is broken.
It's the "once every other year" type of bitflip that's the problem. The proverbial "cosmic ray" hitting your DRAM and flipping a bit. That will be caught by ECC but it'll most likely remain a total mystery if it causes your non-ECC hardware to crash.
It isn't only cosmic rays. Regular old radiation can also cause it. I've read about a server that had many repeated problems and the techs replaced the entire motherboard at one point.
Then one of them brought in his personal Geiger counter and found the radiation coming off the steel in that rack case was significantly higher than background.
You may never know when the metal you use was recycled from something used to hold radioactive materials.
In the mean time I read here the rate to be around 1 bit flip per 1 GB per month. So in an 120 GB System that shd be about 1 flip every 6h. But then this number might be wrong.
> A large-scale study based on Google's very large number of servers was presented at the SIGMETRICS/Performance ’09 conference.[6] The actual error rate found was several orders of magnitude higher than the previous small-scale or laboratory studies, with between 25,000 (2.5 × 10−11 error/bit·h) and 70,000 (7.0 × 10−11 error/bit·h, or 1 bit error per gigabyte of RAM per 1.8 hours) errors per billion device hours per megabit. More than 8% of DIMM memory modules were affected by errors per year
Right. My point that TFA serves zero purpose to most people on here. Those that know how ECC works already know that it is a must have. Those that don't will learn very little from the post because it fails to explain what ECC is and why you need it aside from general statements about memory errors. It will reaffirm for those that know about what ECC RAM is that it's a good idea, but they already know it anyways. It reads a lot like an article about why vitamin C is a good thing.
I can understand Linus's frustration from that point of view: without ECC RAM when you get some super weird crash report where some pointer got corrupted for no apparent reason you can't be sure if it's was just a random bitflip or if it's actually hiding a bigger problem.