|
|
|
|
|
by voidmain
3131 days ago
|
|
On the contrary, the distribution of incorrect values alone is fairly convincing. A "typical threading bug" does not actually flip a randomly selected bit. And the rate of memory errors in consumer machines in the wild is probably quite high, because machines with hard memory errors are not consistently removed from service. "Random" soft memory errors from alpha particles or whatever aren't the dominant source. I've seen stranger things than this, including a cluster of servers in which 3 of the 4 machines had frequent bit flips, and when they were all replaced in response, 2 of the 4 new machines also had frequent bit flips. We diagnosed based on this same kind of distributional evidence (in our case, the bit flips occurred in certain address hyperplanes, consistent on each machine, like having bit 5 flip at an address like 0x???6?3B0 every time). The customer was, as you can imagine, pretty skeptical that this was not a software bug at that point. But all of the machines, when booted into memtest86 or whatever it's called, quickly found errors with the predicted physical address pattern! Dropbox doesn't have the luxury of tracking down the customer machines and testing them. See also "bitsquatting": https://nakedsecurity.sophos.com/2011/08/10/bh-2011-bit-squa... If it doesn't have ECC memory, it's an approximate computing device. |
|