|
This reminds me of a meetup I attended last fall, they were talking about the Spectre/Meltdown issues. I asked the presenters if anything in chip manufacturing/verification processes had changed as a result of that and they seemed surprised. To me, when a software bug shows up in a critical system, that means you actually have a logistics bug. Airplane control software should not be allowed to have bugs. CPUs should not be allowed to have bugs. And OS's should not be allowed to crash (looking at you Microsoft). When one of these things happens, in my opinion the correct response is _not_ to just release fixes and workarounds and then say "we'll try really hard to not let it happen again." You do that, sure. But the first time you see airplane software malfunction, that means you need to change the way the software is written and released so that the whole class of issues will not ever happen again. You don't stop at a public apology, you don't fire the person that unintentionally wrote the bug. If you have to hire mathematicians to formally prove the critical paths of the software, you do that. If it costs 10x more to release bug-free software, oh well, you do that. All of these corporate people thinking they can save money by spending less on quality are extremely naive. You can do a financial analysis of this, but they're doing it wrong. Did you ever consider what the cost of a whole generation just not trusting air travel at all would be? |
This is pretty good intuition but often a systemic change is not economically feasible. For avionics software at least, a rewrite of the software would likely have to be recertified from scratch before it would be allowed to fly.
We do, however, have several different quality assurance programs in Aerospace that are supposed to address this sort of thing.
Once you identify the root cause, the process found to be deficient is supposed to have a Process Owner who is required to create a preventive and corrective action plan to prevent a recurrence, with more severe problems requiring more robust action plans. Done right, the process owner is supposed to be empowered to make the changes that need to be made.
These systems tend to be evolutions of ISO 9000 as pioneered by Toyota (IIRC). They are highly bureaucratic and soul-sucking, but they are also the least-shitty solution that's been tried.