Hacker News new | ask | show | jobs
by robotsteve2 1527 days ago
Any sort of hardware or software error seems much more likely. Computers are incredibly complex and approximations are used everywhere (in the design of the hardware, in the theory of operation). I don't think inference-based experiments or analysis on cosmic ray bit flips are appropriate.

You really need some kind of dedicated cosmic ray detector nearby as a control. If the flux of cosmic rays into the detector is orders of magnitude lower than the rate of bit errors you ascribe to cosmic rays, it's probably some hardware/software issue and not the cosmic rays.

3 comments

Indeed, there was a study in IEEE pointing out the absurdity of cosmic rays as causes -- one point cited was that the vast majority of bitflip happen at specific points in the address space, page boundaries between chips essentially
I'm curious why that is evidence against the cosmic ray explanation.

Couldn't it have something to do with the physical layout of memory? Perhaps those page-boundary-adjacent addresses present a larger physical target, perhaps on the bus.

Of course I am wildly speculating right now. I'd love to see the article if you have a link!

Apologies everyone, the paper I was paper I was thinking of was ACM: "Cosmic Rays Don't Strike Twice" -- https://dl.acm.org/doi/pdf/10.1145/2248487.2150989

The point being that consumer grade memory is straight up error prone, and a bitflip isn't necessarily caused by "cosmic rays" but could just be like, a flaky DRAM chip, network card, etc.

If you are Mozilla it doesn't really matter what the source of bit flips are as much a good understanding of their prevalence and how it might impact your customer experience, telemetry, and even security.

Modern devices have tiny features which are extremely fragile to any sort of interference, which are much more abundant than cosmic rays.

See the row-hammer attack where you can flip an unrelated bit just by read/writes to adjacent bits from software!!!

This is also why memtest has all these testing patterns, and one reason why you typically want to leave it running overnight (or as long as feasible) instead of "yup, we read and wrote all bits and it's all fine!"

Row hammer is one of those things that took advantage of something everyone already kind of knew, just never thought about as a security problem.

This is tangential, but "Cosmic rays do not cause bit flips at a significant rate" is the default position and does not need supporting evidence.

It's like if I claimed that planes occasionally crash due to a local density of dark matter pulling them down. Nobody would need to provide evidence against my theory, it's outrageous. I'm the one who has to provide the evidence.

> This is tangential, but "Cosmic rays do not cause bit flips at a significant rate" is the default position

No it's not. It's very well accepted in the silicon industry that cosmic ray events (the secondary particles) are responsible for enough bit flips that they have to be concerned with designing robust circuits and logic to achieve desired failure rates.

https://www.microsemi.com/document-portal/doc_view/130760-ne...

    14. Are radiation effects at ground-level just a theoretical problem?
    No, based on FIT rate data from Xilinx UG116, the largest Virtex ®-6 device (XC6VLX760) with 184,823,072 configuration bits will have a nominal failures-in-time (FIT) rate of 176 at sea-level in New York. While this represents a mean time between failures (MTBF) of 648 years, a system comprised of 1,000 FPGAs would experience a failure every year. The same systems based in Denver would experience failures every few months.

    15. Are there any widely reported incidents of errors due to charged particles?
    Several incidents across many industries have been reported in recent years. Among these:
    • In 2008, a Quantas Airbus A330-303 pitched downward twice in rapid succession, diving first 650 feet and then 400 feet, seriously injuring a flight attendant and 11 passengers. The cause has been traced to errors in an on-board computer suspected to have been induced by cosmic rays. Modifications were undertaken to mitigate such errors in the future.
    • Canadian-based St. Jude Medical issued an advisory to doctors in 2005, warning that SEUs to the memory of its implantable cardiac defibrillators could cause excessive drain on the unit's battery.
    • Cisco Systems issued a field notice in 2003 regarding its 1200 series router line cards. The noticed warned of line card resets resulting from SEUs.
https://www.intel.com/content/www/us/en/support/programmable...

     Unavoidable atmospheric neutrons remain the primary cause for SEU effects today.
https://www.asminternational.org/documents/10192/26583572/ed...

     One of the most important reliability concerns for silicon circuits is soft errors in SRAM circuits, which involves electrical upsets generated by the interaction of energetic atomic and subatomic particles with the silicon substrate material. SRAMs are particularly sensitive to radiation-induced soft errors due to the relatively low amount of charge at the storage nodes. Errors are generated by the impact of alpha particles emitted from trace amounts of uranium in solder and packaging materials of the circuit, and by neutrons that originate in the cosmic ray shower in the Earth’s atmosphere.
https://www.reliablemicrosystems.com/wp-content/uploads/2021...

    Even in the absence of on-chip sources of radiation, recent studies have conclusively proved that terrestrial cosmic rays (primarily neutrons) are a significant source of soft errors in both DRAMs and SRAMs [169-171]. Upsets have been observed both at ground level and in aircraft and have been convincingly correlated to the altitude and latitude variation of the neutron flux [172,169,171]. Lage, et al. have shown that even without alpha particles, a baseline of cosmic-ray upsets still exists for high-density SRAMs [170]. O’Gorman has shown that neutron upsets disappear for DRAMs placed 200 meters underground in a salt mine, while they increase dramatically for systems operated above 10,000 feet in Leadville, CO [169].

> and does not need supporting evidence.

Even if we did not already have a large body of evidence to show that it was a concern, that would be untrue.

"Cosmic rays do not cause bit flips at a significant rate" is no less a claim than "cosmic rays do cause bit flips at a significant rate", and would require no less evidence. It does not somehow become the "default" just because it predicts little or no interaction.

Here’s contradicting evidence to your position:

https://static.googleusercontent.com/media/research.google.c...

The point op makes is that the more complicate a claim is made, the more evidence is required. More common sources of errors would seem to be more likely and thus more common causes of bit flips.

Thus more evidence is required for the cosmic ray hypothesis being a dominant reason than anything else. We know that empirically there’s ~1 bug in every 1k lines of code. 1 in 10k if you have very good tests. But flip type errors are probably less common so let’s guess and say 1 in 10 million. There’s about ~30 million lines of code in the Linux kernel. There’s probably a similar amount of userspace code (eg Firefox is also around 20 million lines). Then think about the Verilog that backs HW designs. I don’t know the size of those codebases to have estimates but it feels like bit flip bugs are possible there. Then you’ve got to actually synthesize that digital logic and implement it in analog space. Components could easily be driven out of spec electrically (whether by accident, manufacturing defect, or swapping in lower cost components) and bit flips would be comparatively a common type of error when shuttling them around, especially sensitive across high bandwidth links that aren’t error-checked.

The point is, the combined probability of all these sources of errors seems higher probability than true cosmic rays being behind bit flips. The Google paper is just more evidence of this. I’m sure measuring just for cosmic rays you’ll be able to see their impact. In a running production at scale on variable quality hardware running on arbitrary software versions, all other sources of errors would seem like more likely first order effects that would swamp any ability to detect cosmic rays. Not to say that Mozilla hasn’t accounted for it. Just that OP’s position is the default sensible position to start from (ie Occam’s razor).

That's not contrasting evidence. Defects are certainly common sources of error, particularly with cheap commodity components like those used in google's fleet. That does not prove cosmic rays aren't a significant source of SEUs [in any computing device].
I'd be very interested in reading that article if you have a link (or title, or doi...)
I believe people use "cosmic rays" as catch-all phrase for all these very low probability error causes (just because of the coolness of cosmic rays), but in practice _any_ other cause is much more common than cosmic rays.

Even at the processor level every single transistor on it has a rated mean time between failures a.k.a. MTBF. Sure it may be astronomical, but you do have a lot of transistors, so in practice a random bitflip is not such a rare event. Designers actually explore MTBF vs power usage trade-offs here, and there is even a fascinating area of "fault resilient computing" research.

Every single clock domain crossing has another MTBF (google metastability). Again they are very high (billions of years if done properly), but you will have plenty of such crossings (and the number keeps growing with modern, more asynchronous design).

Processors are quite unreliable things.

Ironically, even though the more modern, "asynchronous" (really, more just asynchronous communication between fully-synchronous clock domains) CPU designs result in more chances for metastability, a fully asynchronous, self-timed design wouldn't have to have any likelihood of metastability at all!
Yes, but what you'd want to do is look for coincidences between a detector for a cosmic ray shower around (above?) the electronics you're monitoring with whatever it is these days that instruments ECC events. The time resolution would be pathetic for a nuclear physics experiment, but probably good enough.

If you look at the ambient gamma-ray spectrum in a semiconductor detector (which would be germanium rather than silicon) the main background you see is typically from concrete; I'm ashamed to say I've forgotten the energy from K-40, but in the region of 1500 keV. (Ironically, large concrete blocks used for shielding would be regarded as a significant radiation hazard if all the activity in them was concentrated.)