Hacker News new | ask | show | jobs
by Rochus 1017 days ago
And they shut down the entire system because of one incoherent plan? What a great example of an ingenious high-availability architecture.
1 comments

> rather than reject the flight plan for an aircraft that may already be in the air

I wonder why they can't reject the flight plan for an aircraft that's already in the air? Presumably they have any number of reasons to reject a perfectly valid flight plan that's been submitted yet alone invalid ones there must be a rejection mechanism?

The explanation offered by the report (from a quick skim found this on page 9) is:

> Having found an entry and exit point, with the latter being the duplicate and therefore geographically incorrect, the software could not extract a valid UK portion of flight plan between these two points. > ... > In this case the software within the FPRSA-R subsystem was unable to establish a reasonable course of action that would preserve safety and so raised a critical exception

The failure is portrayed as a reasonable thing to do and yes it's good the system failed safe rather than continued with a bunch of corrupt data no-one knew about but it seems bizarre that a single dodgy flight plan resulting in the whole system having to shut-down was an intentional part of the system design. It does sound like they don't have strong isolation around individual flight plan processing so an exception thrown there just propagated up to bring the whole thing down.

More damningly the duplicated waypoint names with different positions is a known issue with work on-going to produce a globally unique set of names (from what the report says) so this is hardly unexpected. Surely any decent test plan would have included this scenario?

> I wonder why they can't reject the flight plan for an aircraft that's already in the air?

You need to know everything that may be in the air - if you skip the details of a flight that may be in the air, you risk routing another flight through the same space and the possibility of collision? So if you can't do that safely, the only option is to shut down; existing flights can continue but no new flights can be routed until the anomaly is resolved.

> and yes it's good the system failed safe rather than continued with a bunch of corrupt data

The authors of the report obviously made an effort to suggest this; but then on page 18 they nevertheless admit that "A permanent software change by the manufacturer within the FPRSA-R sub-system which will prevent the critical exception from recurring for any flight plan that triggers the conditions that led to the incident.".