| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by aetherspawn 3127 days ago
	I really doubt they were legitimate bit flips and not just software bugs, to be honest. The likelihood that you’ve seen multiple flips on the same piece of data .. sounds like a typical threading bug.

4 comments

jmharvey 3126 days ago

Let's run through the likelihood of seeing multiple flips on the same piece of data.

Google's research [1] finds a DRAM error rate of "25,000 to 70,000 errors per billion device hours per Mbit" on hardware in "modern compute clusters." If there are 100 million Dropbox clients out there, Dropbox clients should encounter 2,500 to 7,000 errors per Mbit per hour, though factoring in the "low-end or old hardware" that many Dropbox clients are running on, the error rate is probably somewhat higher. For the sake of making the math simple, call it 10k errors per Mbit per hour, or 1 error per 100 bit-hours. So a given bit should flip on some user's machine on the order of once a week. That seems pretty firmly in the range of "sometimes we see these weird errors that we don't really understand," especially if you multiply by the same error potentially coming from different parts of the program (so "the same piece of data" is really "a few pieces of data that get collapsed together for purposes of analysis).

Your intuition that a "typical threading bug" is much more likely than a random bit flip is spot on, but that actually works in favor of the "random bit flip" thesis. On the Dropbox scale, a threading bug/race condition would typically show up as a significant, persistent issue, several orders of magnitude more common than the random oddball errors described in the article.

[1]

link

d--b 3126 days ago

> On the Dropbox scale, a threading bug/race condition would typically show up as a significant, persistent issue, several orders of magnitude more common than the random oddball errors described in the article.

Threading bugs could have any kind of frequency, though. The ones that are unfrequent are the ones that tend to make it into production...

link

aetherspawn 3126 days ago

Citation is missing. Thanks for doing the math, very interesting, but I'm still very skeptical! The paper says they observed bit error rates "order of magnitudes higher than previously reported" so should we expect 1 error per week or 1 error per 10 or 100 weeks?

link

jmharvey 3126 days ago

oops: https://static.googleusercontent.com/media/research.google.c...

link

voidmain 3126 days ago

On the contrary, the distribution of incorrect values alone is fairly convincing. A "typical threading bug" does not actually flip a randomly selected bit. And the rate of memory errors in consumer machines in the wild is probably quite high, because machines with hard memory errors are not consistently removed from service. "Random" soft memory errors from alpha particles or whatever aren't the dominant source.

I've seen stranger things than this, including a cluster of servers in which 3 of the 4 machines had frequent bit flips, and when they were all replaced in response, 2 of the 4 new machines also had frequent bit flips. We diagnosed based on this same kind of distributional evidence (in our case, the bit flips occurred in certain address hyperplanes, consistent on each machine, like having bit 5 flip at an address like 0x???6?3B0 every time). The customer was, as you can imagine, pretty skeptical that this was not a software bug at that point. But all of the machines, when booted into memtest86 or whatever it's called, quickly found errors with the predicted physical address pattern! Dropbox doesn't have the luxury of tracking down the customer machines and testing them.

If it doesn't have ECC memory, it's an approximate computing device.

link

d--b 3126 days ago

Yes, that seemed strange to me too. First, there is no particular reason why it would affect the first bit of the comma more that the other ones.

Second, these strings are most likely concatenated on the fly right before sending them over to the server. So it wouldn't be a disk bit flip, it'd be in-memory, and for the life span of that particular string.

Third, if these were that frequent, then it means the _rest_ of the string, ie. the actual hashes, would be wrong, and that would seriously impair the service, wouldn't it?

link

iainmerrick 3127 days ago

Yes! And I was surprised that the conclusion to the hunt for a "favorite bug" was just to ignore it. It's definitely not clear that the diagnosis was correct.

link

haimez 3126 days ago

Don't worry, the analysis is flat wrong. I'm betting there was a C extension in play in their hashing library.

link