Hacker News new | ask | show | jobs
by tjr 1339 days ago
One thing to consider when looking at such things is that commercial avionics software systems are full of known limitations. I do not know if this particular 51-day limitation was intentional or not, but in general:

Avionics software starts with writing comprehensive requirements. When the software itself is developed based on those requirements, it is then tested against the requirements, always in a real functioning airplane, but also often in smaller airplane-cockpit-like rigs and in purely simulated environments.

Nobody is going to write a requirement that says "this avionics subsystem will function without error forever". Even if you thought you could make it happen, you can't test it. So there are going to be boundaries. You might say that the subsystem will function for X days. What happens after that? It may well run just fine for X+1 days, or 2X days, or 100X days. But it's only required to run for X days, and it's only tested and certified for running for X days.

I could easily imagine that this particular subsystem was required and certified for some value of X <=51 days, and it just so happened that if the subsystem ran for over 51 days then it started to fail. Or, it could have been a genuine mistake.

But even if the intended X wasn't 51 days, there almost certainly was some intended, finite value for X. We might say, "well, my laptop has run for three years without needing a reboot". Great! Is that a guaranteed, repeatable state of operation that the FAA would certify? Probably not. And besides that, do we really want to have to endure a three-year verification test?

In most software, we are happy to say, "it should run indefinitely". For avionics software, that's insufficient. We instead say "it will run at least for some specific predetermined finite amount of time" and then back up that statement with certifiable evidence.

5 comments

I work in a field that operates under similar development constraints. (Namely it's a mature product in a mature field with well defined requirements) Because if this I regularly get calls from my customers wondering why their system can't do X or Y in the B way instead of the A way, and I have a similar conversation. Wherein I have to explain "no, that wasn't part of your requirements 5 years ago, if you want to change it, you'll need to pay us for more development", that normally eliminates the requirement for whatever it was they wanted pretty quickly.

Also, uptime is a factor, I've seen what windows looks like when it runs out of GDI objects, it's strange. But once you see it, you can explain to the customer the importance of regular reboot/restarts.

I never understood why regular, and scheduled, reboots are concidered to be a problem to begin with.
It can come with exposure of hidden costs. So a pc which can only be assured to be correct by reboot cannot continuously monitor a flow process which cannot be interrupted for that reboot window. It has to be designed to work with two, or some kind of data buffering has to be designed in, or the specification changed to redefine to continuous(*)

Which btw is what should be done but.. it can cause rage

[*] may not be continuous or complete in all circumstances

But a 787 works fine with those reboots being part of scheduled maintenance. So the issue is what exactly again?
Essentially changes in typical operation procedures at airlines broke previous assumptions about regular full aircraft power downs, which triggered both the 248 day bug and the current 51 day bug.

It used to be that an aircraft would get a full power down as often as daily, but as individual components got more reliable and external power easily available, it became common for aircraft to not be shut down fully between flight days.

Nothing. I respond to a question posing why it might be a problem. It didn't say "in a 787" it was "in general" I suggest a class of problem which it might surface in. The wider question.

All aircraft have schedules of maintenance. Requirements to reboot a computer periodically isn't onerous. It's not onerous but the insane costs of recertification are. Fixing this problem to not require reboot would be very expensive. Not just the FAA process burdens but the wider costs. 787 battery problems probably wrecked the entire profit of the model for years.

The Max flight safety issue on another Boeing aircraft may mean its never profitable. The industry is wierd.

I worked in healthcare where our EMR went into downtime for two hours on daylight transition days. It was extremely disruptive as we had to switch to a paper process for that time period that needed to get reconciled with the EMR at the end of the shift.

Unless you have a dedicated team doing that, preventative reboots and various “workarounds” sound great on paper for administrators but make for a shitty experience for people doing the actual work.

Difference berween your examole and the reboot requirements of various aircraft: aircraft reboots hapoen in controlled environments, on the ground when the aircraft is out of operations and is done be dedicated, trained and certified maintenance staff. Those reboots, while funny on first glance, do not interfere at all with aircraft operations.
Sounds about right. But it’s still a critical failure for a fault of any kind to ever display incorrect information to the pilot.
And in this case it seems one function of the software is interfering with another, which causes the incorrect display.

  I do not know if this particular 51-day limitation was intentional or no
I highly doubt it was intentional. Boeing's already had to issue an AD for similar behavior on the 787:

https://www.engadget.com/2015-05-01-boeing-787-dreamliner-so...

If they knew about it there'd be no need for an AD. Boeing tried to become the aviation equivalent of a fabless chip designer with the 787 and it didn't go well at all. Turns out they had little-to-no experience managing external development and manufacturing teams. I don't know anything about the 51-day bug, but the 248-day bug caused critical failures that you really wouldn't want happening in flight.

> Nobody is going to write a requirement that says "this avionics subsystem will function without error forever".

These time limits could at least be pegged to real-life intervals to when the system is going to be shut down anyway. If the system continues to be operated past that point, skipped maintenance intervals could be underlined as the cause.

It is on fact possible to write provably correct software for safety critical applications.

Not by testing, but by using formal methods.

That's nice for the software. Now how about the hardware? How about the electronic hardware's not-exposed firmware, does that count? Did the subcontractor test it for three years at 10,000 feet for radiation-induced bit-flips? With or without lightning strikes?
> How about the electronic hardware's not-exposed firmware, does that count? Did the subcontractor test it for three years at 10,000 feet for radiation-induced bit-flips? With or without lightning strikes?

Blast the module in a radiation chamber. It can be done, it's only extremely expensive - the military has the budget (makes sense, given that a fighter jet or a bomber should be able to power through a nuclear bomb fallout), but civilian airliners are all about cost efficiency.

Including a system reboot, on the ground, as part of your on going maintenance activities is a fault, or incorrect software.
Is a roof you have to redo every 20 years, or a paint that only lasts 10 years faulty? Is a car that needs brakes replaced every X thousand kilometers faulty?

It is only faulty if it does not run according to spec, or if you it run outside the spec.

Exactly. If the manual says "reboot every 51 hours", you do just that and all is fine. If you have to reboot every, say, 25 hours, something is broken.