Hacker News new | ask | show | jobs
by sharemywin 3028 days ago
I remember hearing about this in my numerical analysis class.

1. I remember hearing the system was only designed for XX operational hours but was being run over the operational spec.

2. The time was stored in base 10 so the calculation errors added up over time or something like that so if they had used some base 2 timing scheme it would haven't have had issues with rounding errors.

My class was in the mid nineties so the details of my 25 year old memory is pretty hazy...at best.

1 comments

My recollection matches with yours, except I learned about it in the first week of Embedded Systems 101. If it isn't a standard part of the curriculum at every college embedded systems class, it should be! It really drove home the point that bad code can kill.
I learned about it in a Decision Analysis course and had a completely different point driven home. This wasn't bad code. It was code that was correctly written to a very well defined requirement ("System shall be operational for at most X hours before a reboot"). The code was written to a spec that was approved by the customer (the military). Unfortunately though, that requirement wasn't communicated to the end users.
From the article

"However, the timestamps of the two radar pulses being compared were converted to floating point differently: one correctly, the other introducing an error proportionate to the operation time so far"

The code had a defect that effects its aim from turning it on but because it took 100 hours to drift by 1/3 of a second the problem wasn't apparent when rebooted regularly. If software can't continue to do basic math without manual intervention its defective.

In fact everyone including the company that made it admits it's defective.

Its possible your teacher picked a great example to illustrate a communication failure.

The Patriot system was originally designed to operate in Europe against Soviet medium- to high-altitude aircraft and cruise missiles traveling at speeds up to about MACH 2 (1500 mph). To avoid detection it was designed to be mobile and operate for only a few hours at one location.

http://archive.gao.gov/t2pbat6/145960.pdf

Page 2

dug into reference 48 from Wikipedia which referenced this article which I did a search on google.

The fact that the bug manifests after a longer than normal period of operation doesn't ex post facto make it not a bug. If you add 2 and 2 and get 42 you failed.

It is however a good explanation why it remained undetected.

Conversations like this are surprisingly common in our industry ;-) To help ease communication there are 2 terms in common usage: software error and bug. A software error is code that is incorrect. A bug is a software error that manifests a user visible problem. In this case the incorrect code is a software error, but it does not manifest a user visible problem unless it is used outside some assumed parameters. The bug doesn't exist when the product is used as intended. One can argue that the behaviour is undefined when used outside of the intended use and therefore there is no bug. There is no arguing about the software error, though. It exists.

Arguing about whether or not something is a bug is pointless precisely because someone will just pull the "behaviour outside of expected use is undefined" thing out of the bag. Regardless of whether or not you should have expected something to work, if your product unintentionally kills people due to a software error, you have a gigantic problem. It's really that lesson we have to keep in mind.

I get this all the time from project managers: it doesn't matter if X fails because we aren't designing the software for X. But you can't just dismiss X -- you need to understand the consequences of X just in case somebody tries to do it. For example: It corrupts the DB if 2 people edit the same record at the same time. The project manager says, "Not a problem. I got sign off from the groups using the app and they promise never to have 2 people working on the same thing. Problem solved, and no need to modify the code!" Of course a week later the DB is corrupted and it's not a bug (it's a feature ;-) ).

It does make software development more costly, and you need to draw the line somewhere. This requires balancing risk. But I will argue that if you are writing software for a missile, there is no hiding behind the "we didn't design it for that" argument.

If by "defective" you mean has rounding errors, then sure. Everything that rounds numbers is defective. To be fair, round errors can sometimes be mitigated by carefully changing the order of operations, but never fully eliminated in those cases.
You can avoid rounding errors 100% of the time for as long as you like. For example you can use integers.

Its entirely possible to have any reasonable degree of precision reasonably required to the limits of our tools to measure.

This isn't about an inherent limit of computation its just programmer error.

I'm failing to find anything that says the requirement was "System shall be operational for at most X hours before a reboot". It's more likely that there was a key performance paramater (KPP) saying that it should be functional for at least some period of time. And that was what was tested.

Generally KPPs (which aren't requirements themselves, but influence the requirements for systems) are set at lower bounds, not upper bounds, for somethnig like this. You wouldn't set a KPP: Should only work for 4 hours. You'd use: Should work for at least 3 hours, 4 hours desirable (or some similar language). If it works for longer, that's great. But longer won't be tested since it's not a requirement or goal for the system, which also means failure modes for longer runtimes won't be encountered because they're outside the bounds of the system requirements and specs.

As I gather a the Patriot was a mobile anti-aircraft / anti-cruise missile platform that was meant to move, be activated when needed, and then be turned off and move again because the original location was expected to become a target. It was pressed, on short notice (with some software upgrades, but not the normal cycle of specs, development, and validation that would go into that kind of repurposing) into stationary, continuous coverage, anti-ballistic-missile (critically, dealing with much faster targets than originally envisioned, which means short warning times where deactivations have a lot more risk) use.

So, while it's horrible in results, it can be very easy to understand why basic functions would have specs not at all adapted to the use to which it was being put.

There's a distinction to be made, though. There was no requirement that it be rebooted after some period of time, though there was an expectation that this would happen by the original developers. Consequently it was not evaluated for 20 hour or 100 hour performance. That's a critical distinction in developing, testing, and fielding systems. And the way we term it in our requirements documents reflects this. We rarely say: System SHALL fail after some period. Rather we say: System SHALL perform for some period. We leave the result of longer durations undefined. The system may work, or it may not, we aren't required to test it and so we don't. If the customer wants it to run longer, we can evaluate it but they have to communicate that back to us (or to the testing facilities, which may not be the developers).

Similarly, with regards to the speed of the missiles, the requirement would not be: System SHALL fail to detect missiles above some threshold speed. But rather: System SHALL detect missiles below some threshold speed. This leaves open the possibility that it may be more or less accurate outside that range. It should be documented for the operators as a potential for failure: System may be ineffective against missiles operating above X m/s. But the requirements wouldn't include that detail.

This pushes the problem into the documentation and training. Since it was originally designed as a mobile platform with short run-times, there was no explicit operating procedure requiring reboots. It was just assumed. At the same time, the failure itself (after 20 hours) was unknown because testing hadn't been done to see what would happen.

Getting a slightly wrong answer ought to have been detectable even after short period of time even if the difference was microscopic.
Not the way we test these things. You set the KPPs and analyze system performance. Especially back then, there wouldn’t have been much in the way of unit testing or anything for these sorts of systems.

You set your performance parameters (have some success rate while operating continuously for up to 4 hours). Then you launch missiles at it (simulated and real). If you stop enough of then you’re good.

Article discussing testing software back in 1976

https://dl.acm.org/citation.cfm?id=807721

No real good excuse for not actually testing systems that can take or save lives.

I just regurgitation about some kind of article the professor brought in.

Wikipedia didn't exist when I was taking the course. It's probably in one of the 100 odd source articles since it wasn't just my professor that pointed it out. One of the other commenters mentioned a similar discussion from one of their professors.

Fair. I wasn't replying to you, your #1 sounds a lot like what I'm saying, though.

  1. I remember hearing the system was only designed for
  XX operational hours but was being run over the
  operational spec.
This is very similar to my "at least" which is very different than "at most". In requirements we wouldn't bound ourselves like that. We wouldn't say our system should run for at most 8 hours. We'd say it should run for at least 8 hours. However, we won't say what happens after 8 hours because we don't test it (it's not a requirement). We may communicate to the operators that the system should be rebooted after some period of time if there's a known or anticipated issue, or we may include a soft boot to reset things. For many of our systems, their operating time is usually under 12 hours (they go on aircraft that don't fly for days at a time, mostly), so we never test anything past about 48 hours anyways. If there's an issue that arises around 96 hours, we'd never know from our testing and only know about if an operator pushed it to that limit and recorded the circumstances properly.
The Patriot system was originally designed to operate in Europe against Soviet medium- to high-altitude aircraft and cruise missiles traveling at speeds up to about MACH 2 (1500 mph). To avoid detection it was designed to be mobile and operate for only a few hours at one location.

http://archive.gao.gov/t2pbat6/145960.pdf

Page 2

I never did embedded programming or government programming, so what your saying make sense from a spec perspective.

Its interesting that FM 44-85 "Patriot Battalion and Battery Operations" is publicly available and pretty easy to find. We discussed this in a systems analysis class back in '04 using a copy of FM 44-85 released in '97. In summary the class blamed TRADOC and the tech writers for publishing a manual that did not accurately reflect real world use cases, with the software bug being a secondary concern.

I googled up a copy of FM 44-85 to refresh my memory and write this post, its pretty much as I remember it.

The doctrine in chapter 3, planning, is extreme mobility and rapid hour to hour activation and deactivation of individual missile batteries, kinda like infantry bounding overwatch but glacially slower on an hourly basis, for example see Table 3-2 where the four batteries are rotating on and off and moving/maintaining on detailed hour by hour basis, so the doctrine seems to be uptimes should typically be on the order of 3 or 4 hours maybe. Not a zillion days in a row as actually deployed when the software bug hit.

The doctrine in chapter 5, operations, goes into a big discussion of defense design strategies. The weapon system is inherently sectorized this naturally leads to overlapping areas of fire being very important. You have to ask why the unit that had a ridiculous uptime never shut down to perform daily maintenance which would inherently involve rebooting stuff, its no big deal to down a system because sectorization and overlap is inherently built into the technology. Its reasonably well understood that technically you can tell an individual infantry soldier to guard a post for 100 hours or 1000 hours continuously, but someone screwed up if they issued an order like that because its simply impractical. That leadership failure will be discussed later. So... aside from the question of why the software failed under ridiculous conditions, you have to ask WHO more or less knowingly misapplied the resource without backup or planned maintenance intervals? Possibly this section of the FM was rewritten between the tragedy and the the release of the copy I have access to, but its still poorly written. Or what section of the FM would have ever given the officers the idea that the weapons system can be deployed the way they did it? The idea that the weapon system could do what they told it to do came from somewhere and it apparently was not the documentation?

The doctrine in chapter 6, support, has a little blurb about battalion level staff officers. What did the EMMO think about keeping a patriot booted up and running for 100 hours without a maintenance interval? Missile maintenance is literally his only job. And if that slot was unfilled, its the job of S4 and the XO to cover or reassign someone or otherwise work around. Around page 6-16 there's a discussion about operators being responsible for maintenance... I had a humvee assigned to me, I hated it, it leaked oil all the time, but the point is even my junky humvee had daily maintenance tasks for PMCS. The patriot missile PMCS checklist is probably classified, but if a lowly humvee has daily maint, how can a missile not have a much longer and more complicated daily maint? And this implies someone is pencil whipping maint (I mean, everyone kinda does that, but..)

Its hard to summarize a class discussion but from the point of view of a systems analysis class, mostly non-military other than myself, the end users were being innovative and adapting and overcoming which unfortunately means the doctrine and specifications of the weapon system have little to do with how its being used. The class considered this the biggest systems analysis mistake of the tragedy. Why even write docs and specs if the users won't read them and they have no relation to what the users want to do? I guess a good HN analogy would be you could creatively deploy binary executables using the "cat" command and hand typing unicode and that would be a nifty hack to work around a problem but would be a pretty stupid way to operate normally. Specifically the Army's own docs used to train and plan operations imply shorter operations terms interspersed with maintenance intervals and deep redundancy, none of which seems to have anything to do with the failed deployment.

There was a big argument in class that it had nothing to do with systems analysis and was merely a leadership failure, using the example above of technically you can order a soldier to stand guard for a hundred hours, when guard shifts are normally a couple hours, and when he passes out asleep around 48 hours into his shift, you can try to blame the soldier or declare there's a bug in our brain preventing 100 hour deployments, or you can even blame the manual and the technical writers for not putting a warning in the manual not to do dumb things, but fundamentally thats just passing the buck that it was a failure of leadership to assign a unit to a task its not designed to handle, then cover it up by pretending its merely a software bug or something. I don't know enough history of the tragedy; its possible the Army correctly relieved some officers of command and its only the media and press who blame the software bug.

You can imagine the look on the face of the software developers when they got the bug report; like dude, did you ever read FM 44-85, or if you aren't reading it, what are you reading, so we can read it?

The software guys likely never read the field manuals. I know I never did. I read the specs and requirements documents, which are different things than what operators receive. The program office is responsible for maintaining synchronicity between the two with regards to performance parameters (reqs and specs) and performance expectations (manuals). The test office should’ve been familiar with both sides as well.