Hacker News new | ask | show | jobs
by cpgxiii 2023 days ago
The simple answer is you need at least three to identify and recover from a single failure, five for two simultaneous failures, and so on (generally assuming failures can be recovered from automatically by rebooting the failed controller). Depending on the planned exposure, you can estimate the probability of upset events and thus the likelihood of multiple failures within the failure->reboot time interval and pick the number of computers accordingly. Radiation exposure depends on altitude - low-earth orbits outside of the Van Allen belts are fairly low due to protection from the Earth's magnetic field, while trips to other planets must be more hardened (either via shielding, significantly greater redundancy, or rad-hard circuit design).

The most difficult part, historically, is ensuring no single point of failure in a redundant system. Put three computers on a single bus, and it's likely each of the three bus transceivers could cause a complete system failure (so you've tripled the failure rate). In some systems like aircraft FBW, each of the controllers has its own connection to the actuators and its own actuator. The computers are connected to each other to detect if each other have failed, but as a fallback the control surface and actuators are designed so that two good actuators can physically overpower a bad actuator, and this ensures that the mechanical coupling doesn't become the failure point.

1 comments

Thanks this is really interesting. It makes sense about how to calculate how many processors you'd need based on the time and upset frequency. Really appreciate your answers! :)