Hacker News new | ask | show | jobs
by jmharvey 3128 days ago
Let's run through the likelihood of seeing multiple flips on the same piece of data.

Google's research [1] finds a DRAM error rate of "25,000 to 70,000 errors per billion device hours per Mbit" on hardware in "modern compute clusters." If there are 100 million Dropbox clients out there, Dropbox clients should encounter 2,500 to 7,000 errors per Mbit per hour, though factoring in the "low-end or old hardware" that many Dropbox clients are running on, the error rate is probably somewhat higher. For the sake of making the math simple, call it 10k errors per Mbit per hour, or 1 error per 100 bit-hours. So a given bit should flip on some user's machine on the order of once a week. That seems pretty firmly in the range of "sometimes we see these weird errors that we don't really understand," especially if you multiply by the same error potentially coming from different parts of the program (so "the same piece of data" is really "a few pieces of data that get collapsed together for purposes of analysis).

Your intuition that a "typical threading bug" is much more likely than a random bit flip is spot on, but that actually works in favor of the "random bit flip" thesis. On the Dropbox scale, a threading bug/race condition would typically show up as a significant, persistent issue, several orders of magnitude more common than the random oddball errors described in the article.

[1]

2 comments

> On the Dropbox scale, a threading bug/race condition would typically show up as a significant, persistent issue, several orders of magnitude more common than the random oddball errors described in the article.

Threading bugs could have any kind of frequency, though. The ones that are unfrequent are the ones that tend to make it into production...

Citation is missing. Thanks for doing the math, very interesting, but I'm still very skeptical! The paper says they observed bit error rates "order of magnitudes higher than previously reported" so should we expect 1 error per week or 1 error per 10 or 100 weeks?