Hacker News new | ask | show | jobs
by Travis 5250 days ago
Gotcha, so it would just depend on what bits in RAM were corrupted (data versus process instructions)?

I am curious about a couple of specific use cases that I can think of where this might affect me (and, most likely, would be common points of failure for others):

1. data in the MySQL tables (stored in memory) is corrupted. Would mysql crash? Indicate table corruption, and I could just reload it from disk? Write corrupted data to disk, and permanently trash my data collection?

2. large process that's running data analysis (say a big python process with tons of data in RAM). Would one of my variables (say an int with value 4) turn into another number? Would it become unreadable?

I appreciate the effort to explain this. I know, in theory, why ECC RAM is useful, but I have difficulty visualizing real world scenarios.

2 comments

If it was a C int then it would just change to another value. If it was a Python int then two things can happen: either the bit flip was in the value which causes the value to change, OR the bit flip was in the tag bits which causes Python to interpret the data as something else than an int. The latter would most likely cause your program to crash.

With MySQL any of those things you can happen. If you're lucky then only the cache is corrupted and then you can just reload from disk. If you're unlucky then the data got corrupted on its way to disk and the wrong data will be written to disk. If you are astronomically unlucky then the in memory machine code of MySQL got changed in such a way that it starts overwriting your entire disk with garbage. You should probably be more afraid of meteorites though. And of bugs in either your own or others' code.

ECC RAM reduces the probability of such a bit flip happening. That doesn't mean that they are eliminated entirely. So you have to do these two things in any case:

1. Bit flips can cause processes to misbehave/crash. So you want to have a way to detect and restart misbehaving/crashed processes.

2. Even with ECC RAM you want to do your own error correction for critical data (say a bank transaction log).

Here is an interesting paper that discusses the prevalence of DRAM errors and the effectiveness of ECC RAM:

DRAM Errors in the Wild: A Large-Scale Field Study -- http://www.cs.toronto.edu/~bianca/papers/sigmetrics09.pdf

It would be interesting if somebody did an experiment where they artificially flipped bits of various software's memory to see what happens. I'd expect that in many cases it doesn't do any harm at all.

I suggest looking into studies of radiation effects upon computer systems. They do a lot of bit-flipping. I was privy to results from a confidential study once, and as one might expect, enough bit flips cause big problems (the study went into more details than that, of course).
The answer to most of your questions is 'maybe', unfortunately. Reasoning about the things that could happen when memory errors occur is very difficult, because they occur outside the mental model of computation that most programmers (and systems administrators) use.

Let's use MySQL as an example. A bit flip in the memory which holds the code may cause it to crash. A bit flip in the 'metadata' could cause the table to become corrupted, potentially recoverably. A bit flip in the data itself could turn 'Travis' into 'Trbvis', which might go undetected depending on where it happened and which storage engine you are using.

The use of memory for OS page caching (less so in databases, which often use O_DIRECT and more so in other programs) means that arbitrary corruption could happen to pieces of disk data your program didn't even touch, if you touch data near them.