Hacker News new | ask | show | jobs
by belter 1898 days ago
They have to do two things :-)

1) Use ECC memory

2) Go underground

"One experiment measured the soft error rate at the sea level to be 5,950 failures in time (FIT = failures per billion hours) per DRAM chip. When the same test setup was moved to an underground vault, shielded by over 50 feet (15 m) of rock that effectively eliminated all cosmic rays, zero soft errors were recorded.[6] In this test, all other causes of soft errors are too small to be measured, compared to the error rate caused by cosmic rays."

"Soft Errors" https://en.wikipedia.org/wiki/Soft_error#Cosmic_rays_creatin...

5 comments

> 1) Use ECC memory

Not exactly. When I was in telco, where I had this problem was in FPGA's, we had all ECC memory and I never linked any problems to bit flips in RAM. But as I remember, the FPGA's we had were using a type of SRAM cell, but because it's not a memory module the FPGA programming could bit flip. So the product had a checksum function, that read back the program on a cycle and reset itself if the program no longer matched the checksum. So we would see 1-2 crashes / restarts per week in our FPGAs that we believe were bit flips.

We then ran an anlysis on any of these that higher than expected error rates to try and identify actually bad hardware and replace them.

I think the vendor eventually came up with a way to reprogram the FPGA without just crashing and rebooting the entire board.

Many modern FPGAs now include dedicated logic for config SRAM "scrubbing." This logic continuously checks config frame checksums to identify upsets. These can then be fixed in real-time either using the error correction properties of the checksum technique, or from the non-volatile config memory (typically NOR flash). It's also important to note that only a subset of the SRAM config bits are critical for a given application. Usually this is a small percentage of the overall array.

https://www.xilinx.com/support/documentation/application_not...

If even higher levels of reliability are needed, there are rad-hard-by-design FPGA families (e.g. Xilinx Virtex 5QV). These have a special config SRAM cell that has more charge storage sites than a conventional SRAM cell. It is less area efficient than a conventional SRAM cell, but geometry of the charge storage sites ensures that a single cosmic ray can't flip the state of a majority of them. Essentially the cell can self-correct, no scrubbing required.

Interesting and makes sense. Do you have any additional references you would suggest and particularly in the context of FPGAs ?

Would you say this quick reference is a good overview ? https://www.intel.com/content/dam/www/programmable/us/en/pdf...

Sorry, I should have mentioned this was quite a few years ago, so I'm very out of date. So I don't have any known good references that are handy. That link you shared seems pretty good on a quick scan through and inline with what I remember, I'm pretty sure I dug up similar resources for other vendors, including one I think was looking at satellite hardware.
Compaq/HP handled this with many redundant resources / cores: https://en.wikipedia.org/wiki/Tandem_Computers
Polyethylene is supposedly good at blocking cosmic rays. It would be funny to me if the fix was just a drop ceiling full of old grocery bags.
Have a source? That seems like an awesome way to recycle grocery bags.
Water is also a good protector. And the polyethylene is actually a proxy for "hydrogen atoms". The reason is that dense hydrogen has a lot of protons to interact with the incoming radiation.

But unfortunately plastic bags are not dense polyethylene. You would kind of need full blocks of solid polyethylene...

You can make solid polyethylene out of plastic bags by pressing them in a mold heated to ~100-150C.

But it's quite flammable, so you might not want to use it as a building material.

The peer comment posted a source that has lots of references.

Though most of them are tests in space, where I assume the thickness requirements would rule out grocery bags. I am curious how thick a layer of HDPE you would need on earth to make any notable difference.

That linked nature paper seems to indicate 10g/cm^2 for a 50% reduction. A standard shopping bag is about 5g, so you would need roughly 20k bags/m^2, or ~2k bags/ft^2. At 0.9g/cm^3, that would be roughly 11cm of solid polyethylene
I just squished a plastic bag as much as I could, and got it down to an inch cubed. So if I hypothetically did that for 2000 bags in 1ft^2 (sorry for English units), I think that would mean it could be 13-14in thick. So maybe it is reasonable to attenuate radiation by about 50% in a rather generous drop ceiling?
Ah, so bags are out, as is practically the ceiling. But cut sheet HDPE is easy to find, so tiles atop your machine would be relatively cheap and easy.
I work for a particle accelerator, and I can confirm you that our beam dump uses polyethylene for neutron shielding: https://www.sciencedirect.com/science/article/pii/S092037961...
Doesn't sound like that solution is plenum-rated.
Nice, I'm going to line the rooftops with Tyvek now.
5950 failures per billion hours is approximately equal to 1 failure every 19 years.
And if you have a datacenter with 200 machines, that's roughly once a month.
Each machine is going to have way more than 1 DRAM chip. A 128GB DDR4 stick is going to have ~20 chips on one side (not sure if they're double sided, just looking at product listing), and you're going to have terabytes of RAM per machine.
That's a bit more than just 'go underground'. I've seen people dig for cables but I've never seen them dig a hole 15m deep.
A 'Faraday Cage' should also help.

edit: No idea why the downvotes. Faraday cages have long been thought of as a way to protect electrical devices from the electromagnetic waves which can be a result of solar flares.

I even chatted to someone from NASA’s Solar Dynamics Observatory about it in the past.

Cosmic rays are not electromagnetic waves. They are highly energetic particles, like protons and naked helium nuclei.
Because a Faraday Cage would make it worst not better:

https://www.reddit.com/r/askscience/comments/1fsiv9/how_effe...