Hacker News new | ask | show | jobs
by joelaaronseely 3997 days ago
There is another mechanism called "Single Event Upset" (SEU) or "Single Event Effects" (SEE) (basically synonymous). This is due to cosmic rays. On the surface of the earth, the effect is mostly abated by the atmosphere - except for neutrons. As you go higher in the atmosphere (say on a mountaintop, or an airplane, or go into space) it becomes worse because of other charged particles that are no longer attenuated by the atmosphere.

The typical issue at sea level is from neutrons hitting silicon atoms. If a neutron hits the neucleus in some area of the microprocessor circuitry, it suddenly recoils, basically causing an ionizing trail of several microns in length. Given transistors are now measured in 10s of nanometers, the ionizing path can cross many nodes in the circuit and create some sort of state change. Best case it happens in a single bit of a memory that has error correction and you never notice it. Worst case it causes latchup (power to ground short) in your processor and your CPU overheats and fries. Generally you would just notice it as a sudden error that causes the system to lock up, you'd reboot and it would come back up and be fine, leaving you with a vague thought of, "That was weird".

5 comments

How often does this sort of thing actually happen in real life? Or rather, what's the chance that some given computer will experience one of these events in its operational lifetime (or, if the chance is actually high enough, how many such events would it be expected to see on average given a lifespan of several years)?
Somewhere in the range that your laptop will almost certainly never see even a single event, but a very large datacenter or colo will have multiple events a month.

There is a lot of disagreement on bitflips from ionizing radiation. They are unequivocally real, and unequivocally very rare. Even when they do happen, a large portion of the chip is dark a lot of the time, and a lot of the live data in the chip is simply thrown away and never used. (Think prefetching) Some bits, if flipped, will break something but will not corrupt the disk and the machine will be able to recover.

Nobody really knows for certain exactly how big of a problem they are and how often they happen- it's all statistics, and it depends on things like where on the globe your computer is, what your building is made of, and what phase of the solar cycle we are in. It even depends on workload. Anybody who claims to know for certain...

Try multiple times a second. This guy made a hobby of taking advantage of cosmic radiation bit flips to cause dns lookup problems and capturing data.

http://dinaburg.org/bitsquatting.html

Multiple times a second, if your pool of hardware is "all the internet connected hardware in the world"! Neat experiment.

Also, FWIW that experiment will include people subject to bit errors in DRAM, not just in the CPU- and I would even guess that bit errors are more common in DRAM than SRAM given their electrical characteristics (a tiny floating capacitor vs. two inverters driving eachother)

Bit flips aren't solely caused by radiation- it could also be caused by clock skew or a failing crystal oscillator on an overheating router or something...
When I talked to somebody at Blue Waters (petascale supercomputer at UIUC), she told me that they had uncorrectable errors once or twice per day, even with ECC. Blue Waters has 22640 compute nodes that contain 16 cores each (we'll ignore the GPU nodes). So even if your typical home computer had 16 cores, you would expect 12 hours * 22640 = 31 years between uncorrectable errors.

Caveats: most computers don't have ECC, and I don't remember if Blue Waters was completely installed when I visited.

Heh. When I went to college one of my professors told us cosmic rays made computer systems with more than four megabytes completely impractical.

Oh, and get off my lawn.

> Oh, and get off my lawn.

Indeed, that had to be, what, in the early to late 70s? I remember the Cray X-MP in the early 80s supported up to 16MB and frequently came with 4 to start.

On CCD/CMOS sensors it can happen enough to be visible as dead columns or pixels. Pixels can be calibrated out.

Most notably, gamma rays killing columns was the suspected cause of dead columns while filming Superman a few years back. As a result, cameras were shipped by ship rather than air to minimize this happening ( a bit of a reactionary tale to this particular incident and not generally what happens).

Forgot to mention... in talking with NASA about deploying cameras on the space station, the issue being able to fix pixel death by gamma ray is more relevant. Water is apparently one defense against it but not so practical. How much water would you need?
Water has a halving thickness of ~18cm for gamma rays. So a fair bit.

Unfortunately, most other things are also of the same order of magnitude of ~20g/cm^2 - with gamma rays the single most important thing is just "how much mass is in the way". Which is exactly what you don't want.

This is a good paper on this topic: http://dl.acm.org/citation.cfm?id=1555372.

Another slightly older paper with similar data: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.115...

It happens often enough that NASA has asked FPGA synthesis tool vendors to implement error correction feature on this. Basically, when state machine goes into some error condition/state, the system is reset to a known state.
NASA also needs hardware to operate under extreme conditions. For example, anything that goes into orbit is going to have a significantly increased likelihood of a gamma ray burst (because it's no longer protected by the atmosphere). I'd also imagine that they have a much lower tolerance for faults than your average consumer machine as well (because they're doing things that are much more critically important).
The figure I've seen quoted in the context of embedded systems is 1 bit flip a month.
> This is due to cosmic rays. On the surface of the earth, the effect is mostly abated by the atmosphere - except for neutrons.

So, if we had more hydrogen (either free or compounds) in the air this would not be the case, right?

The column of air on top of your head is equivalent (in terms of mass) to a column of water 10 meters tall, with the same base section area. But the composition is quite different, of course - the only major component they have in common is oxygen.

Cosmic rays have energies in the tens to hundreds MeV range where pair production is the dominant attenuation mechanism. The probability of a photon inducing pair production in a material is roughly proportional to the square of the proton number, ergo cosmic rays don't give a shit about hydrogen (which has the smallest proton number possible). Even if the atmosphere was pure hydrogen gas at STP, the average distance traveled by a 20 MeV cosmic ray would be around 17km. http://physics.nist.gov/PhysRefData/XrayMassCoef/ElemTab/z01...

Hydrogen is pretty good at moderating neutrons down to thermal energies (eV range, ie room temperature) via elastic scattering, but gasses don't really have enough density to do a very good job. If you really want to protect something from neutrons you just coat it with boron. A mm coating of the stuff will keep out pretty much any common source of neutrons.

Right, I was thinking about the neutrons. Maybe something like methane would be more efficient than pure hydrogen. Anyway, it's just a thought.

It's very surprising to see how efficient boron is. I thought neutron shields (paraffin, water) are supposed to be very thick. Maybe boron does the job via a different mechanism?

Boron is almost exclusively an absorber, so the boron nucleus captures the neutron and it basically disappears. Hydrogen is primarily a moderator, so it reduces the energy via elastic scattering, but a significant number of very low energy neutrons still escape (some absorption to produce deuterium also occurs). Thermal neutrons can still cause damage, but they're easy to block with a subsequent thin layer of lead or something like that.

Boron's probability to capture a neutron is astronomically high, that's why you can get away with so little. Environmental sources of neutrons are actually pretty rare normally and most neutrons you do see will be pretty low energy and won't have a huge amount of penetrative power. A thin layer of boron will pretty much stop them. Pyrex (like the stuff baking dishes are made of, which is borosilicate glass - glass with boron added) is actually commonly used as control material in nuclear reactors.

US Patent 7309866 - Cosmic ray detectors for integrated circuit chips - Intel (applied for 2004-06, issued 2007-12)

http://www.google.com/patents/US7309866

Is Intel using any such things in their commodity chips?

Next time a surprising bug pops up on my server I'll just blame it on cosmic rays and reboot.
"Sunspots" is a very old sysadmin joke.
I actually have had one user take me seriously with that one. It's been a while, but I believe I also attributed her problem to the phase of the moon as well.
I tend to deduce and eventually blame either gravity or gremlins. I was never wrong.
As process nodes shrink and this happens more often, perhaps we'll eventually have to move to a more probability-based model of computation, giving up classical predictability.
Well, this is what researchers were predicting 12-14 years ago. There was a lot of work in fault tolerant and probabilistic computer architectures as a result of this. (I believed the prediction and contributed to some of this work.)

The prediction was that chips below 32nm wouldn't work reliably and the only option would be to use these exotic architectures.

Well, here we are at 14nm and everything seems to be going okay.