|
They have to do two things :-) 1) Use ECC memory 2) Go underground "One experiment measured the soft error rate at the sea level to be 5,950 failures in time (FIT = failures per billion hours) per DRAM chip. When the same test setup was moved to an underground vault, shielded by over 50 feet (15 m) of rock that effectively eliminated all cosmic rays, zero soft errors were recorded.[6] In this test, all other causes of soft errors are too small to be measured, compared to the error rate caused by cosmic rays." "Soft Errors"
https://en.wikipedia.org/wiki/Soft_error#Cosmic_rays_creatin... |
Not exactly. When I was in telco, where I had this problem was in FPGA's, we had all ECC memory and I never linked any problems to bit flips in RAM. But as I remember, the FPGA's we had were using a type of SRAM cell, but because it's not a memory module the FPGA programming could bit flip. So the product had a checksum function, that read back the program on a cycle and reset itself if the program no longer matched the checksum. So we would see 1-2 crashes / restarts per week in our FPGAs that we believe were bit flips.
We then ran an anlysis on any of these that higher than expected error rates to try and identify actually bad hardware and replace them.
I think the vendor eventually came up with a way to reprogram the FPGA without just crashing and rebooting the entire board.