| > This is tangential, but "Cosmic rays do not cause bit flips at a significant rate" is the default position No it's not. It's very well accepted in the silicon industry that cosmic ray events (the secondary particles) are responsible for enough bit flips that they have to be concerned with designing robust circuits and logic to achieve desired failure rates. https://www.microsemi.com/document-portal/doc_view/130760-ne... 14. Are radiation effects at ground-level just a theoretical problem?
No, based on FIT rate data from Xilinx UG116, the largest Virtex ®-6 device (XC6VLX760) with 184,823,072 configuration bits will have a nominal failures-in-time (FIT) rate of 176 at sea-level in New York. While this represents a mean time between failures (MTBF) of 648 years, a system comprised of 1,000 FPGAs would experience a failure every year. The same systems based in Denver would experience failures every few months.
15. Are there any widely reported incidents of errors due to charged particles?
Several incidents across many industries have been reported in recent years. Among these:
• In 2008, a Quantas Airbus A330-303 pitched downward twice in rapid succession, diving first 650 feet and then 400 feet, seriously injuring a flight attendant and 11 passengers. The cause has been traced to errors in an on-board computer suspected to have been induced by cosmic rays. Modifications were undertaken to mitigate such errors in the future.
• Canadian-based St. Jude Medical issued an advisory to doctors in 2005, warning that SEUs to the memory of its implantable cardiac defibrillators could cause excessive drain on the unit's battery.
• Cisco Systems issued a field notice in 2003 regarding its 1200 series router line cards. The noticed warned of line card resets resulting from SEUs.
https://www.intel.com/content/www/us/en/support/programmable... Unavoidable atmospheric neutrons remain the primary cause for SEU effects today.
https://www.asminternational.org/documents/10192/26583572/ed... One of the most important reliability concerns for silicon circuits is soft errors in SRAM circuits, which involves electrical upsets generated by the interaction of energetic atomic and subatomic particles with the silicon substrate material. SRAMs are particularly sensitive to radiation-induced soft errors due to the relatively low amount of charge at the storage nodes. Errors are generated by the impact of alpha particles emitted from trace amounts of uranium in solder and packaging materials of the circuit, and by neutrons that originate in the cosmic ray shower in the Earth’s atmosphere.
https://www.reliablemicrosystems.com/wp-content/uploads/2021... Even in the absence of on-chip sources of radiation, recent studies have conclusively proved that terrestrial cosmic rays (primarily neutrons) are a significant source of soft errors in both DRAMs and SRAMs [169-171]. Upsets have been observed both at ground level and in aircraft and have been convincingly correlated to the altitude and latitude variation of the neutron flux [172,169,171]. Lage, et al. have shown that even without alpha particles, a baseline of cosmic-ray upsets still exists for high-density SRAMs [170]. O’Gorman has shown that neutron upsets disappear for DRAMs placed 200 meters underground in a salt mine, while they increase dramatically for systems operated above 10,000 feet in Leadville, CO [169].
> and does not need supporting evidence.Even if we did not already have a large body of evidence to show that it was a concern, that would be untrue. "Cosmic rays do not cause bit flips at a significant rate" is no less a claim than "cosmic rays do cause bit flips at a significant rate", and would require no less evidence. It does not somehow become the "default" just because it predicts little or no interaction. |
https://static.googleusercontent.com/media/research.google.c...
The point op makes is that the more complicate a claim is made, the more evidence is required. More common sources of errors would seem to be more likely and thus more common causes of bit flips.
Thus more evidence is required for the cosmic ray hypothesis being a dominant reason than anything else. We know that empirically there’s ~1 bug in every 1k lines of code. 1 in 10k if you have very good tests. But flip type errors are probably less common so let’s guess and say 1 in 10 million. There’s about ~30 million lines of code in the Linux kernel. There’s probably a similar amount of userspace code (eg Firefox is also around 20 million lines). Then think about the Verilog that backs HW designs. I don’t know the size of those codebases to have estimates but it feels like bit flip bugs are possible there. Then you’ve got to actually synthesize that digital logic and implement it in analog space. Components could easily be driven out of spec electrically (whether by accident, manufacturing defect, or swapping in lower cost components) and bit flips would be comparatively a common type of error when shuttling them around, especially sensitive across high bandwidth links that aren’t error-checked.
The point is, the combined probability of all these sources of errors seems higher probability than true cosmic rays being behind bit flips. The Google paper is just more evidence of this. I’m sure measuring just for cosmic rays you’ll be able to see their impact. In a running production at scale on variable quality hardware running on arbitrary software versions, all other sources of errors would seem like more likely first order effects that would swamp any ability to detect cosmic rays. Not to say that Mozilla hasn’t accounted for it. Just that OP’s position is the default sensible position to start from (ie Occam’s razor).