Hacker News new | ask | show | jobs
by thebestmoshe 901 days ago
Isn’t this basically what SpaceX is doing?

> The flight software is written in C/C++ and runs in the x86 environment. For each calculation/decision, the "flight string" compares the results from both cores. If there is a inconsistency, the string is bad and doesn't send any commands. If both cores return the same response, the string sends the command to the various microcontrollers on the rocket that control things like the engines and grid fins.

https://space.stackexchange.com/a/9446/53026

2 comments

Seems risky. I remember the automated train control system for the Vienna Hauptbahnhof (main train station) had an x86 and a SPARC, one programmed in a procedural language and one in a production language. The idea was to make it hard to have the same bug in both systems (which could lead to a false positive in the voting mechanism).
This is a great technique to avoid common-mode failures.
Do you have data to back that claim up? I remember reading evidence to the contrary, namely that programmers working on the same problem -- even in different environments -- tend to produce roughly the same set of bugs.

The conclusion of that study was that parallel development mainly accomplishes a false sense of security, and most of the additional reliability in those projects came from other sound engineering techniques. But I have lost the reference, so I don't know how much credibility to lend my memory.

After some searchengineering I found Knight and Leveson (1986) “AN EXPERIMENTAL EVALUATION OF THE ASSUMPTION OF INDEPENDENCE IN MULTI-VERSION PROGRAMMING” which my memory tells me us the classic paper on common failure modes in reliability via N-version software which I was taught about in my undergrad degree http://sunnyday.mit.edu/papers.html#ft

Leveson also wrote the report on Therac 25.

That was the reason for the different programming paradigms (Algol-like vs Prolog-like), to reduce the probability.
Isn't this exactly what aeroplanes do? Two or more control systems made in different hardware, etc?
I'm not saying people aren't doing it! I'm just not sure it has the intended effect.

(Also to protect against physical failures it works, because physical failures are more independent than software ones, as far as I understand.)

That sounds way too low. Modern fly-by-wire planes are said to have 12-way voting.
> Modern fly-by-wire planes are said to have 12-way voting

Do you have a source for that? Everything I've ever read about Airbus says the various flight control systems are doubly redundant (three units). Twelve sounds like it would be far beyond diminishing returns...

That was word of mouth. This website says 5 independent computers, of which 2 use different hardware and software so as not to fail in the same fashion.

https://www.rightattitudes.com/2020/04/06/airbus-flight-cont...

I'd imagine every computer relies on redundant stick/pedal encoders, which is how a 12-way notion appeared.

There's several subsystems that have backup functionality or piloting fallback available incase of subsystem failure, and subsystems have internal 2-weay or 3-way redundancy/voting. See eg https://aviation.stackexchange.com/questions/15234/how-does-...
That blog isn't very authoritative, and doesn't go into any detail at all.

> I'd imagine every computer relies on redundant stick/pedal encoders, which is how a 12-way notion appeared.

That's disingenuous at best. The lug nuts on my car aren't 20x redundant... if you randomly loosen four, catastrophic failure is possible.

This shallow dismissal sounds "sus". It's just off.
If you read the link it’s actually two cpu cores on a single cpu die each returning a string. Then 3 of those cpus send the resulting string to the microprocessors which then weigh those together to choose what to do. So it’s 6 times redundant in actuality.
That’s not 6x though.

It’s a more solid 3x or 3x+3y, which… if you had a power failure at a chip doesn’t take a 6x to make it 5x. It makes it 4x with the two remaining PHY units because two logical cores went down with one error.

The x being physical units, and the y being CPUs in lockstep so that the software is confirmed to not bug out somewhere.

It’s 6x for the calculated code portion only, but 3x for CPU and 1-3x for power or solder or circuit board.

I know it’s pretty pedantic, but I would call it the lowest form for any quality, which is likely 2-3x.

I don't understand this. If two or more computers fail in the same way simultaneously, isn't it much more likely that there is a systemic design problem/bug rather than some random error? But if there is a design problem, how does having more systems voting help?
It is possible for a random error to affect two computers simultaneously, if they are made from the same assembly line, they may fail in exactly the same way, especially if they share the same wires.

That's the reason I sometime see that for RAID systems, it is recommended to avoid buying all same disks at the same time, because since they will be used in the same way in the same environment, there is a good chance for them to fail at the same time, defeating the point of a redundant system.

Also, to guard against bugs and design problems, critical software is sometimes developed twice or maybe more by separate teams using different methods. So you may have several combinations of software and hardware. You may also have redundant boards in the same box, and also redundant boxes

They are not going to fail the same way simultaneously. This is protecting against cosmic ray induced signal errors within the logic elements, not logic errors due to bad software.
The multi processor voting approach seeks to solve issues introduced by bit flips caused by radiation, not programming issues.
Having at least 3 computers allows you the option to disable a malfunctioning computer while still giving you redundancy for random bit flips or other environmental issues.
Which is why different sets of computers will run software developed by independent groups on different principles, so that they very unlikely to fail simultaneously.
It's more complicated than that, in the link, they described it better:

>> The microcontrollers, running on PowerPC processors, received three commands from the three flight strings. They act as a judge to choose the correct course of actions. If all three strings are in agreement the microcontroller executes the command, but if 1 of the 3 is bad, it will go with the strings that have previously been correct.

This is a variation of Byzantine Tolerant Concensus, with a tie-braker to guarantee progress in case of absent voter.

> Byzantine Tolerant Concensus

I was taken to task for mis-spelling "consensus"; I used to spell it with two 'c's and two 's's, like you. It was explained to me that it's from the same root as "consent", and that's how I remember the right spelling now.

Good point.
I’m curious how often the strings are not in agreement. Is this a very rare occurrence or does it happen often?