Hacker News new | ask | show | jobs
by aetherspawn 3127 days ago
I really doubt they were legitimate bit flips and not just software bugs, to be honest.

The likelihood that you’ve seen multiple flips on the same piece of data .. sounds like a typical threading bug.

4 comments

Let's run through the likelihood of seeing multiple flips on the same piece of data.

Google's research [1] finds a DRAM error rate of "25,000 to 70,000 errors per billion device hours per Mbit" on hardware in "modern compute clusters." If there are 100 million Dropbox clients out there, Dropbox clients should encounter 2,500 to 7,000 errors per Mbit per hour, though factoring in the "low-end or old hardware" that many Dropbox clients are running on, the error rate is probably somewhat higher. For the sake of making the math simple, call it 10k errors per Mbit per hour, or 1 error per 100 bit-hours. So a given bit should flip on some user's machine on the order of once a week. That seems pretty firmly in the range of "sometimes we see these weird errors that we don't really understand," especially if you multiply by the same error potentially coming from different parts of the program (so "the same piece of data" is really "a few pieces of data that get collapsed together for purposes of analysis).

Your intuition that a "typical threading bug" is much more likely than a random bit flip is spot on, but that actually works in favor of the "random bit flip" thesis. On the Dropbox scale, a threading bug/race condition would typically show up as a significant, persistent issue, several orders of magnitude more common than the random oddball errors described in the article.

[1]

> On the Dropbox scale, a threading bug/race condition would typically show up as a significant, persistent issue, several orders of magnitude more common than the random oddball errors described in the article.

Threading bugs could have any kind of frequency, though. The ones that are unfrequent are the ones that tend to make it into production...

Citation is missing. Thanks for doing the math, very interesting, but I'm still very skeptical! The paper says they observed bit error rates "order of magnitudes higher than previously reported" so should we expect 1 error per week or 1 error per 10 or 100 weeks?
On the contrary, the distribution of incorrect values alone is fairly convincing. A "typical threading bug" does not actually flip a randomly selected bit. And the rate of memory errors in consumer machines in the wild is probably quite high, because machines with hard memory errors are not consistently removed from service. "Random" soft memory errors from alpha particles or whatever aren't the dominant source.

I've seen stranger things than this, including a cluster of servers in which 3 of the 4 machines had frequent bit flips, and when they were all replaced in response, 2 of the 4 new machines also had frequent bit flips. We diagnosed based on this same kind of distributional evidence (in our case, the bit flips occurred in certain address hyperplanes, consistent on each machine, like having bit 5 flip at an address like 0x???6?3B0 every time). The customer was, as you can imagine, pretty skeptical that this was not a software bug at that point. But all of the machines, when booted into memtest86 or whatever it's called, quickly found errors with the predicted physical address pattern! Dropbox doesn't have the luxury of tracking down the customer machines and testing them.

See also "bitsquatting": https://nakedsecurity.sophos.com/2011/08/10/bh-2011-bit-squa...

If it doesn't have ECC memory, it's an approximate computing device.

Yes, that seemed strange to me too. First, there is no particular reason why it would affect the first bit of the comma more that the other ones.

Second, these strings are most likely concatenated on the fly right before sending them over to the server. So it wouldn't be a disk bit flip, it'd be in-memory, and for the life span of that particular string.

Third, if these were that frequent, then it means the _rest_ of the string, ie. the actual hashes, would be wrong, and that would seriously impair the service, wouldn't it?

Yes! And I was surprised that the conclusion to the hunt for a "favorite bug" was just to ignore it. It's definitely not clear that the diagnosis was correct.
Don't worry, the analysis is flat wrong. I'm betting there was a C extension in play in their hashing library.