Hacker News new | ask | show | jobs
by cozzyd 162 days ago
Until you have a bit flip or a silicon error. Or someone changed the floating point rounding mode.
3 comments

Funny you should mention the floating point rounding mode, I actually had to fix a bug like that once. Our program worked fine, until you printed to an HP printer - then it crashed shortly after. It took forever to discover the cause - the printer driver was changing the floating point rounding mode and not restoring it. The fix was to set the mode to a known value each and every time after you printed something.
That is amazingly devious. Well done HP.
> Until you have a bit flip

These are vanishingly unlikely if you mostly target consumer/server hardware. People who code for environments like satellites, or nuclear facilities, have to worry about it, sure, but it's not a realistic issue for the rest of us

Bitflips are waaay more common than you think they are. [0]

> A 2011 Black Hat paper detailed an analysis where eight legitimate domains were targeted with thirty one bitsquat domains. Over the course of about seven months, 52,317 requests were made to the bitsquat domains.

[0] https://en.wikipedia.org/wiki/Bitsquatting

> Bitflips are waaay more common than you think they are... Over the course of about seven months, 52,317 requests...

Your data does not show them to be common - less than 1 in 100,000 computing devices seeing an issue during a 7 month test qualifies as "rare" in my book (and in fact the vast majority of those events seem to come from a small number of server failures).

And we know from Google's datacenter research[0] that bit flips are highly correlated hard failures (i.e. they tend to result from a faulty DRAM module, and so affect a small number of machines repeatedly).

It's hard to pin down numbers for soft failures, but it seems to be somewhere in the realm of 100 events/gigabyte/year - and that's before any of the many ECC mechanisms do their thing. In practical sense, no consumer software worries about bit flips in RAM (whereas bit flips in storage are much more likely, hence checksumming DB rows, etc).

[0]: https://static.googleusercontent.com/media/research.google.c...

1 in 100,000 devices is about 1 in about 40,000 customers due to how many devices most people own.

Which means if you're about medium business or above, one of your customers will see this about once a year.

That classifies more as "inevitable" than "rare" in my book.

> That classifies more as "inevitable" than "rare" in my book.

But also pretty much insignificant. Is any other component in your product achieving 5 9s reliability?

We're not talking 5 9s, here.

> ... A new consumer grade machine with 4GiB of DRAM, will encounter 3 errors a month, even assuming the lowest estimate of 120 FIT per megabit.

The guarantees offered by our hardware suppliers today, is not "never happens" but "accounted for in software".

So, if you ignore it, and start to operate at any scale, you will start to see random irreproducible faults.

Sure, you can close all tickets as user error or unable to reproduce. But it isn't the user at fault. Account for it, and your software has less glitches than the competitor.

Of course, any attempt at safety or security requires defense in depth.

But usually, any effort spent on making one layer sturdy is worth it.