Hacker News new | ask | show | jobs
by jules 5249 days ago
If it was a C int then it would just change to another value. If it was a Python int then two things can happen: either the bit flip was in the value which causes the value to change, OR the bit flip was in the tag bits which causes Python to interpret the data as something else than an int. The latter would most likely cause your program to crash.

With MySQL any of those things you can happen. If you're lucky then only the cache is corrupted and then you can just reload from disk. If you're unlucky then the data got corrupted on its way to disk and the wrong data will be written to disk. If you are astronomically unlucky then the in memory machine code of MySQL got changed in such a way that it starts overwriting your entire disk with garbage. You should probably be more afraid of meteorites though. And of bugs in either your own or others' code.

ECC RAM reduces the probability of such a bit flip happening. That doesn't mean that they are eliminated entirely. So you have to do these two things in any case:

1. Bit flips can cause processes to misbehave/crash. So you want to have a way to detect and restart misbehaving/crashed processes.

2. Even with ECC RAM you want to do your own error correction for critical data (say a bank transaction log).

Here is an interesting paper that discusses the prevalence of DRAM errors and the effectiveness of ECC RAM:

DRAM Errors in the Wild: A Large-Scale Field Study -- http://www.cs.toronto.edu/~bianca/papers/sigmetrics09.pdf

It would be interesting if somebody did an experiment where they artificially flipped bits of various software's memory to see what happens. I'd expect that in many cases it doesn't do any harm at all.

1 comments

I suggest looking into studies of radiation effects upon computer systems. They do a lot of bit-flipping. I was privy to results from a confidential study once, and as one might expect, enough bit flips cause big problems (the study went into more details than that, of course).