Hacker News new | ask | show | jobs
by KuiN 1020 days ago
Here's the full preliminary incident report if anyone wants to read it:

https://publicapps.caa.co.uk/docs/33/NERL%20Major%20Incident...

Obviously concerning that a single (perfectly valid) flight plan can take down both the primary and backup. Reject the flight plan that the system can't understand, you've got 4 hours for someone on front-line support to be able to work out the correct path and enter it manually? Meanwhile it'd be good if the system continued to operate.

Futher concerns about first and second line support being unable to find in the logs the cause or even the flight plan being processed when the systems failed. Had to bring in the 3rd party developers to look at "lower-level" logs to find out what happened. If your monitoring/logging isn't good enough that the first responder can't work out at least what the system was doing when it failed, that's a significant problem.

2 comments

Most of the system did continue to operate - but it couldn't accept new flightplans automatically; the flight plans were given 4 hours in advance, so they only put in the restrictions after a couple of hours. Still I agree it would have been better if it had continued - but yes the most important bit is how long it took to find the bad plan.
Yep. From the report: At 0832 both systems failed and the controllers started to empty the four-hour buffer. At 1100, systems still weren't back and so to avoid the hard cutover that looked to be coming at at 1230, they began the switch to manual mode. It took them until 1336 to restore the automatic systems, and until 1803 to fully switch out of manual mode.

Given those operational constraints, it sounds like the support teams basically have two hours to resolve a critical system failure.