Hacker News new | ask | show | jobs
by explanibrag 3703 days ago
I once read that NASA control engineers have three independent teams code up three versions of their guidance systems. If the systems disagree, they go with the majority vote.
5 comments

The jargon for this practice is multiversion programming. Its based on the idea that different people will make different errors when implementing a design. However, in practice we find that errors are actually moderately or even strongly correlated between different programmers. So this practice is rather uncommon.
It's one of techniques I used to recommend although for anti-subversion instead of safety. I was worried whst you said might turn out true. My solution and hypothesis is that using three, very-different languages should counter that effect. Hard to imagine the same error happening in PreScheme, SPARK, and C.
IIRC, the Space Shuttle had four identical computers: three ran the same (full-featured) software and detected errors with a voting algorithm; the fourth ran different (minimally-featured) software developed by a separate team as a final backup.

The fourth computer's software was only sufficient to abort the mission and return the shuttle to the Earth. IIRC, they never had to use it.

"Every ship has three AI's. Due to the radiation and interference, all three suffer lapses in sanity. They're encased in lead, but the sensors are mostly wide open to radiation. Often, they 'hallucinate' from from observing the outside world with faulty sensors, so they often vote as to whether or not an input is actually real.

"A fourth AI is purposefully kept dormant. He's a bit single minded and his only purpose is to find a planet and touch down. If the other three ever can't agree or have a moment of clarity in which they realize they've become unstable, they deactivate and activate him. He regularly reloads from scratch, forgetting his previous incarnations and spends a majority of time validating the coordinates his previous incarnation left for him, making course adjustments, and ensuring the humans don't prevent him from saving them..."

I would read that book...

Maybe you should write it instead. Then we can read it. :)
Interesting to compare Frank Herbert's Destination: Void:

The crew are just caretakers: the ship is controlled by a disembodied human brain, called "Organic Mental Core" or "OMC", that runs the complex operations of the vessel and keeps it moving in space. But the first two OMC's (Myrtle and Little Joe) become catatonic, while the third OMC goes insane and kills two of the umbilicus crew members. The crew are left with only one choice: to build an artificial consciousness that will enable the ship to continue. The crew knows that if they attempt to turn back they will be ordered to abort (self destruct).

https://en.wikipedia.org/wiki/Destination:_Void

There's an old saying that goes something like, "When going to sea, take one clock or three—never two." The idea being if you bring two and they disagree, you won't know which one is right.
Every Airbus since the A320 uses the same system in their fly by wire design as well. Plus several fallback mechanisms where other computers can take over failed computers work or augument their work. For example the ELAC controls the ailerons and the SEC controls the spoilers, if the ELAC fails the SEC can take over and provide roll control via the spoilers, although limited (and those changes also come with a change of the planes flight envelope)
I can't remember if this is true of Airbii or if it's Boeings I'm thinking of, but I remember reading a while back that the three microcontrollers they run on are also from different manufacturers.
No idea about Boeing, but yes for Airbus. Every “computer" is actually two computers, one is the active computer (COM, command) and one is the inactive computer (MON, monitoring). Both still perform the same calculations based on the same input, however they use different software and hardware. There is a watchdog inbetween the two that verifies the results against each other, just in case there is a bug in the hardware or software. Then, you also have multiples of these computers, eg. there are two ELACs and three SECs. The ELACs and SECs are fed data from a different air data inertial reference unit (ADIRU) and use a different hydraulic line to actuate the flight controls. And lastly the results the ELACs and SECs come up with also have to agree with each other or the result is thrown out.

All of that redundency makes it possible to build some really robust flight envelope systems that keep the airplane within safe margins.

* I should note that all of this applies to the A320 family, the systems have been developed even further in recent years. For example with the A350 Airbus made some steps towards allowing the flight computers to be used in Simulators so that the same software/hardware as on the real plane can be used.

I'm a tad late, but I'm curious about where you said

> they use different software and hardware

Does this mean different architectures? If it does my respect for the redundant-hardware approach just went through the window.

Also, how is the watchdog redundant? I can't imagine there's only one; how does this work? Are both watchdogs somehow wired in parallel, are they cross-connected to each other, or...?

It litereally means two completely different architectures, on two physically disconnected computers. Here is a diagram: https://i.imgur.com/Tj0GKbQ.png

Also visible in the diagram is that each side has its own watchdog, both connected to each other. The way this whole thing work is fail safe, so if one computer fails the backup can jump in, and if that fails too the flight controls will either retract or stop in their current position depending on what makes the most sense. It’s also mirrored, so for example if spoiler 2 on the right wing fails and is retracted, spoiler 2 on the left wing will also retract.

Here is a description about the flight controls and how pilot input gets passed through to the control surfaces: http://www.smartcockpit.com/docs/A320-Flight_Controls.pdf

And here is a general overview about the architecture: http://www.skybrary.aero/bookshelf/books/2313.pdf

Oh, wow, that's amazing. Now I understand why avionics are so expensive - verifying the correctness of such a system sounds like a lot of "fun," or at least a lot of time.

(I wonder if there are any systems built on multiple architectures where each unit is itself a redundant system with CPUs in lockstep.....)

Am I to intuit from this diagram that the watchdog watches all the components - power, I/O, memory, and CPU? That's very impressive. Or does it watch a central bus/backplane everything is connected to?

Also, how does either side decide/figure out the other side has failed? Simply deciding that the other half is wrong if it doesn't match this half's output could fail catastrophically if one of the sides reaches this conclusion after entering an invalid state (ie, it's the other side that is correct, and this side is wrong).

I'm also mildly curious as to why the I/O on one side has two connections to the actuators, while the other has only one.

What do they do if all three disagree?
That will never happen. It would take the simultaneous failure of two independent systems, each of which are highly reliable. If the MTBF (mean time between failures) of one component were, let's say, ten years of continuous use then they each have a 1/87600 chance of failing in each usage-hour.

The odds of a simultaneous (within one hour) double failure is the square of that, or 1/763760000 per hour. This corresponds to a MTBF of roughly 3836880000 hours or 438000 years.

1000 of same model flying 12 hours a day (utilisation is nearer 11[1], Some airliners where made in numbers a lot more than 1000, 737's where over 8000[2]) for 20 years is 10000 flight years (very crudely).

438000/10000 is 43.8, so 2.3% chance over 20 years.

It's a longshot but it's not never.

[1] http://web.mit.edu/airlinedata/www/2014%2012%20Month%20Docum...

[2] https://en.wikipedia.org/wiki/List_of_most-produced_aircraft

So yes the odds of any one plane experiencing that problem are absolutely tiny but across the fleet not so much.

But compared to the other failures modes, still extremely unlikely. It's not really worth trying to prevent an error that happens to the whole fleet once every 400 years when you could work on fixing other problems that cause plane crashes much more frequently.
Agreed but I never said anything about engineering priorities, my observation was that unlikely events happen at scale.
I understand 'wear and tear' failure is extremely unlikely to strike simultaneously within the same flight, but what about intentional disruption? Is that possible or has it been explored?
If the bad guys can disrupt one clock, they can probably disrupt all the clocks you have, whether that's three or twenty.
If there is no mode, take the median for any numeric or otherwise orderable values. For non-orderable values, let the AIs pass around a single "I'm correct this time" token. Multiple simultaneous faults aren't going to be common enough to think up fancy recovery modes for them.