Hacker News new | ask | show | jobs
by eridius 3997 days ago
How often does this sort of thing actually happen in real life? Or rather, what's the chance that some given computer will experience one of these events in its operational lifetime (or, if the chance is actually high enough, how many such events would it be expected to see on average given a lifespan of several years)?
7 comments

Somewhere in the range that your laptop will almost certainly never see even a single event, but a very large datacenter or colo will have multiple events a month.

There is a lot of disagreement on bitflips from ionizing radiation. They are unequivocally real, and unequivocally very rare. Even when they do happen, a large portion of the chip is dark a lot of the time, and a lot of the live data in the chip is simply thrown away and never used. (Think prefetching) Some bits, if flipped, will break something but will not corrupt the disk and the machine will be able to recover.

Nobody really knows for certain exactly how big of a problem they are and how often they happen- it's all statistics, and it depends on things like where on the globe your computer is, what your building is made of, and what phase of the solar cycle we are in. It even depends on workload. Anybody who claims to know for certain...

Try multiple times a second. This guy made a hobby of taking advantage of cosmic radiation bit flips to cause dns lookup problems and capturing data.

http://dinaburg.org/bitsquatting.html

Multiple times a second, if your pool of hardware is "all the internet connected hardware in the world"! Neat experiment.

Also, FWIW that experiment will include people subject to bit errors in DRAM, not just in the CPU- and I would even guess that bit errors are more common in DRAM than SRAM given their electrical characteristics (a tiny floating capacitor vs. two inverters driving eachother)

Bit flips aren't solely caused by radiation- it could also be caused by clock skew or a failing crystal oscillator on an overheating router or something...
When I talked to somebody at Blue Waters (petascale supercomputer at UIUC), she told me that they had uncorrectable errors once or twice per day, even with ECC. Blue Waters has 22640 compute nodes that contain 16 cores each (we'll ignore the GPU nodes). So even if your typical home computer had 16 cores, you would expect 12 hours * 22640 = 31 years between uncorrectable errors.

Caveats: most computers don't have ECC, and I don't remember if Blue Waters was completely installed when I visited.

Heh. When I went to college one of my professors told us cosmic rays made computer systems with more than four megabytes completely impractical.

Oh, and get off my lawn.

> Oh, and get off my lawn.

Indeed, that had to be, what, in the early to late 70s? I remember the Cray X-MP in the early 80s supported up to 16MB and frequently came with 4 to start.

On CCD/CMOS sensors it can happen enough to be visible as dead columns or pixels. Pixels can be calibrated out.

Most notably, gamma rays killing columns was the suspected cause of dead columns while filming Superman a few years back. As a result, cameras were shipped by ship rather than air to minimize this happening ( a bit of a reactionary tale to this particular incident and not generally what happens).

Forgot to mention... in talking with NASA about deploying cameras on the space station, the issue being able to fix pixel death by gamma ray is more relevant. Water is apparently one defense against it but not so practical. How much water would you need?
Water has a halving thickness of ~18cm for gamma rays. So a fair bit.

Unfortunately, most other things are also of the same order of magnitude of ~20g/cm^2 - with gamma rays the single most important thing is just "how much mass is in the way". Which is exactly what you don't want.

This is a good paper on this topic: http://dl.acm.org/citation.cfm?id=1555372.

Another slightly older paper with similar data: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.115...

It happens often enough that NASA has asked FPGA synthesis tool vendors to implement error correction feature on this. Basically, when state machine goes into some error condition/state, the system is reset to a known state.
NASA also needs hardware to operate under extreme conditions. For example, anything that goes into orbit is going to have a significantly increased likelihood of a gamma ray burst (because it's no longer protected by the atmosphere). I'd also imagine that they have a much lower tolerance for faults than your average consumer machine as well (because they're doing things that are much more critically important).
The figure I've seen quoted in the context of embedded systems is 1 bit flip a month.