Hacker News new | ask | show | jobs
by jonp888 1011 days ago
> How can a primary AND it's backup system fail safely??? Who specified this?

All safety critical systems are specified to halt instead of performing undefined behavior, if they encounter something that cannot be processed. An unsafe failure would be entering undefined behaviour. What would you have specified differently, that would be safer?

A backup is primarily there in case of hardware failures or for maintenance. If it behaves differently to the primary then something is wrong. Can you explain how and why you would expect a backup system running identical software to behave differently?

1 comments

I worked in safety-critical ATC projects in engineering and management positions (systems, quality and compliance engineering) for a decade. ATC systems are supposed to not fail, even under adverse conditions. Where high availability is required for safety reasons, redundant architectures is one of the options. Apparently the "backup system" was conceived for this purpose. According to the report (page 17) the responsible subsystem suffered from a "critical exception [..] that triggers the conditions that led to the incident", which let both the primary and backup system fail, and has now apparently been fixed. So obviously the system was not supposed to fail on receiving wrong or suspicious flight plan data, and it was apparently pure luck that no such data arrived for five years. To claim that the subsystem (consisting of the primary and backup system) "safely failed" indicates significant gaps in safety management (either faulty safety analyses, faulty specifications, or faulty configuration or software). The report suggests that critical omissions occured at several levels.
For me it's important to consider the 'ATC system' as the whole. The system as a whole did not fail - no planes crashed, flights still flew - but it was in a degraded state with lower than usual throughput. One component of the system did fail (the FPRSA subsystem) and it seems reasonable to me that given layers of the system lean towards unavailability rather than trying to continue to operate in unforeseen circumstances.

The purpose of a backup system is not to prevent failure - it's to improve resiliency of the system as a whole across a set of foreseen and unforeseen faults. Backup systems failing to handle any specific fault is an expected and predicted behavior. Thankfully in this case there was a backup system that prevented a complete shutdown (and, thankfully, any accident) - the manual processing of flight plans.

Missing the availability requirements is a failure.

Safety is not only about human lives, but also about health and property (also e.g. critical financial and other losses, or reputational damage). The present incident has obviously caused considerable damage. We can only hope that the rest of the system does not suffer from similar omissions and that it is not pure coincidence that even worse events occur.

Yeah of course, but success/failure is also not binary. There are degrees of failure, including low-consequence availability issues, high-consequence availability issues, loss of operational safety, 'never events' (e.g. significant loss of life). In this case the system suffered the second of those options. It seems reasonable that design choices may prioritise that type of failure over the later ones in the list.

The first part of this argument is semantics - how do we define failure. The second part is IMHO more important - what decisions are taken with regards to the behavior of subsystems and how they influence overall system degredation. In this case the overall design prevented any loss of operational safety which, to me, is a success.

One can talk things up or fix them. As the report (and some comments) suggests, the former is given high priority.