Hacker News new | ask | show | jobs
by sillysaurus3 3700 days ago
Software shouldn't necessarily try to account for errors in that manner. Usually, the most graceful thing to do is to exit cleanly.

For example, if there is a massive amount of data, it has to be stored on disk. It's too large to keep in memory. And if the point of the program is to transform that data in real time, then it has to have access to the disk.

The antivirus basically unplugged the disk. What can it do to recover? There's nothing to be done.

It should be able to survive that situation, of course. When the disk is plugged back in, it should be able to restart without any problems. But I think that's a different kind of resiliency than what you're referring to.

In this case, the only way to recover would be to copy the frozen data to a new area of the hard drive, assuming it retained read access. But such complexities result in brittle implementations, prone to acquiring bugs. What if the disk space runs out? So you check beforehand whether there's enough space. But what if some other program starts consuming disk space in the middle of your copy operation? And so on. It's an endless spiral of design complexity.

The situation in the article seems closer to hardware failure than a design oversight.

3 comments

> When the disk is plugged back in, it should be able to restart without any problems. But I think that's a different kind of resiliency than what you're referring to.

Yes and no. I was referring to restarting internally when the error condition went away but restarting the app and waiting for telemetry to return can be a valid solution.

Think of your torrent software. If you crank your firewall to block it while it's running it will not crash. If your disk fills up it won't crash. When the network comes back or more drive space if freed it will restart it's internal mechanisms. You wouldn't want it to restart in these conditions. If it runs out of memory however choosing to exit might be the best recovery mechanism.

I think a life critical medical application can at least strive for internal restart and do an external restart if all else failed. The article stated they had to reboot the machine to get it back. Now that's way worse.

> The situation in the article seems closer to hardware failure than a design oversight.

Hardware failure is almost always a permanent condition. This was a "my I/O stopped briefly and would have came back if my code could handle it".

During a surgery, the program doesn't have the luxury of showing a screen that says "No telemetry available." Such a program would be considered equally unreliable. Worse, it would lead to confusion: "Why is the telemetry unavailable? What does 'Error Code 2931' mean?"

A spectacular crash immediately led to pinpointing the problem: The antivirus.

If the program's sole purpose is to transform a massive amount of data in real time, it must have disk access by definition. It can't not have disk access. What would you suggest it do?

Yes it does! Showing "no telemetry available" is exactly what it should do. Crashing = unreliable. Reporting an error condition = reliable.

Immediately? Took them 5 minutes to reboot the computer. The scan of the folder would take seconds let alone minutes. Pinpointing the problem is secondary. Not killing the patient is primary.

> If the program's sole purpose is to transform a massive amount of data in real time, it must have disk access by definition. It can't not have disk access.

And that is the mind set the programmers of the software had. You have to take care of error conditions. The processing can't have no disk access but no disk access can occur temporarily or permanently. What can you do? Pause the processing part of your program. Or make the processing part treat "no data" as valid input and display something else.

Imagine taking that viewpoint with an ECG machine: This machine displays a heart rate waveform. So it must have a heart rate input. If there is no heart rate we'll just crash requiring a 5 minute reboot.

Hell no! Draw a straight line and set off a buzzer!

I agree with you, but the flat line might not be the best example because that has a very specific meaning (asystole) that doctors will take certain actions based on without necessarily trying to verify it manually when time is already critical. You should never be able to confuse an error message for anything else.
> You should never be able to confuse an error message for anything else.

Exactly. Which is why "Can't read file sensors.dat" is way better than just crashing. Crashing is one of the worst error messages you can get because you don't know what happened.

" Avernar 18 hours ago

Yes it does! Showing "no telemetry available" is exactly what it should do. Crashing = unreliable. Reporting an error condition = reliable.

Immediately? Took them 5 minutes to reboot the computer. The scan of the folder would take seconds let alone minutes. Pinpointing the problem is secondary. Not killing the patient is primary."

Well-put. Far back as Burroughs B5000, the best way to handle erroneous software or I/O was to freeze it, notify the administrator/user of the problem, and give them sensible options for how to proceed. They might restart the I/O, restart the app, modify erroneous data to proceed (rare here), and so on. Crash and reboot is a Windows 95/NT strategy where incompetence dominated. Today's Windows OS and tooling can do much better with little effort by developers.

But the spectacular crash didn't pinpoint the problem. That came afterwards, when the manufacturer was able to look into the crash.

In a situation like this, confusion is all but inevitable. As a developer, the goal should be to minimize that confusion to the greatest extent possible. A blank screen and crash introduces another step to the process as people wonder "what's going on?" instead of "shit, it threw an error." It's probably not a big deal, but with medical devices during surgery, that extra step could be hugely problematic.

I wonder what surgeons think about software engineers? Except for open heart surgery, they don't normally do their fixes by stopping and starting the thing they are repairing...
I totally disagree.

Sure, usually the most graceful thing to do is exit and hope a human fixes it. But that's usual because the usual condition is that sudden failure is NBD and a human is right there to screw with it.

That's becoming less common, though. When software was mostly something running on a PC doing some boring office task, reliability didn't matter. But as software is running our airplanes, our cars, our medical devices, and even, as with implanted pacemakers and insulin pumps, our bodies, then reliability gos from NBD to BFD.

We see the way forward with things Chaos Monkey [1] and crash-only software [2] and the sort of design for failure you see in things like Agent supervisor hierarchies [3], where the way to reliability is through designing for failure recovery from the beginning and testing thoroughly to make sure it really happens.

[1] https://github.com/Netflix/SimianArmy/wiki/Chaos-Monkey

[2] https://en.wikipedia.org/wiki/Crash-only_software

[3] http://doc.akka.io/docs/akka/snapshot/scala/fault-tolerance....

It wasn't a rhetorical question. What could this program possibly do to recover?

If the CPU fails, no one would say the program was unreliable.

In this case, the disk failed, because the antivirus unplugged it. Was the program unreliable?

The disk did not fail. An I/O operation failed. The first is a permanent condition, the second can be permanent or transient. Big difference.

In Linux a signal can cause an I/O to fail. In Windows it's antivirus and other background tasks can cause I/O to fail.

What can it do to recover? Retry the I/O operation! It should keep trying until the operator tells it to stop.

In that scenario, your surgeon would see the program suddenly freeze.

The program likely looks like this: data acquisition -> transformation -> display transformation on monitor.

If the transformation step fails, the monitor will end up displaying (a) nothing, (b) random data, or (c) the most recent image. None of these help the surgeon continue surgery. It's the same as a crash.

If your environment fails, there's nothing you can do to recover. Planes aren't designed to survive the loss of a wing. Why is this case any different?

> In that scenario, your surgeon would see the program suddenly freeze.

Only if the programmer or his management were incompetent. The display routine should be running on a separate thread than the processing code. No whole program freeze should occur.

As for displaying random data, why would the programmer want to do this? Either display nothing or the last readings WITH a message that it's not real time.

It's not the same as a crash! A crash requires 5 minutes minimum guaranteed. Restarting instantly after telemetry returns can happen under a second in the best case which can be the difference between a live and dead patient.

> If your environment fails, there's nothing you can do to recover. Planes aren't designed to survive the loss of a wing. Why is this case any different?

There are different kinds of failure. Permanent and transient. Following the permanent procedure for a transient case can be fatal.

Take your airplane example. Loss of a wing is permanent. That would be like the CPU failing or an external cable being cut.

But your engines shutting down can be permanent or transient. Just like disk I/O failing. You'd use the transient procedure in this case. Keep trying to restart the engines. If they restart, great! You've just saved the plane.

Same with the disk I/O. The programmer should keep trying to restart the I/O. If it comes back, great! You've just saved the patient.

Definitely. Each component should do its best to keep on keeping on. The display program should keep displaying something, even it's just the most recent data with a big "connection lost" warning. The device should ring-buffer the data and upon reconnection the screen should show as much as possible. The OS should have a strong opinion that the surgery app is very important, and that should the app fail, it should be restarted instantly.

Moreover, this is the kind of thing that should come up in robustness testing. Things should get bumped and wiggled. They should get unplugged and turned off. If the software is really going to run on random Windows boxes, then it should be tested on random Windows boxes. (At which point somebody will hopefully say, "Wow, this sucks, let's make it an appliance.")

No matter what happens, it shouldn't result in a "mysterious crash right in the middle of a heart procedure when the screen went black and doctors had to reboot their computer".

I had to step away from this conversation because of how aggressive you were being. Now that no one is watching, we might try to have a productive conversation.

Please consider dropping the adversarial attitude. This place isn't like other sites. The way people converse is equally important to what they say. It's better to transcend than to dominate.

For example, we do not slip in underhanded comments like this:

> In that scenario, your surgeon would see the program suddenly freeze.

Only if the programmer or his management were incompetent

This is just short of a personal attack, which is against the rules. I know you probably didn't mean it that way, but look at how you're framing the debate. I felt as if I'd been teleported onto Fox News and forced to defend myself from an aggressive interviewer's mischaracterizations.

Now, you can take the stance that "It's not against the rules, so I can say whatever I want." That's true, you can. But we're worse off for it. We optimize for good conversation here.

The point I'm trying to get across is that if you really throw yourself into this community, wholeheartedly and without a feeling of having to prove something wherever you go, then this place has a lot to offer. You'll meet a lot of interesting people, you'll hear a lot of interesting stories, and perhaps you'll have an opportunity to contribute to something quite unexpected. But none of that will happen if you try to skewer your opponents wherever you go -- or if you see people here as opponents. We're people.

It doesn't matter what the conversation is. It doesn't matter whether it's about life-or-death, or that this one happened to be about a surgery. The goal is to put yourself in the other person's shoes and to ask yourself, "If I were them, why would I say that?"

Regarding our conversation, if you want to continue it, I'd be happy to. But unless you're trying to learn as much from me as I'm trying to learn from you, it's not going to go anywhere productive. And what would be the point? No one's looking anymore -- it's fallen off the front page, so it's just you and me here. But why should our conversation be so different just because nobody is watching?

There are things to be said, but I have no time to defend myself. You can characterize what I was saying however you want. Or, alternatively, you could ask me what I meant.

I won't pull one of those "I've been in the field for a pretty long time, so I bet you'll learn something..." routines. Those are tired refrains, usually coming from people who have long forgotten what it's like to be young and hungry. But I'm still pretty young, and money's low enough that I'm pretty hungry. Being unable to afford meat is unfortunate, but it's worth not having a job for a little while to throw myself into my research. See why there's no time to defend against aggression?

I think I wrote this because in many ways, you remind me of how I used to be. And if I could go back in time, I'd ask myself what I was doing and why. This type of discourse is an intellectual dead-end. No one is going to learn a thing from watching people try to tear each other apart. Maybe you didn't realize that's what you were doing. It's very easy to slip into that mindset without realizing it.

As for displaying random data, why would the programmer want to do this?

GPUs are bastards. They ignore what programmers want, almost by definition. And as someone who has spent way-too-many years wringing as much performance as possible from them, I assure you that this is a realistic characterization of a possible outcome.

Perhaps that piques your curiosity. If so, then that sounds like the start of a good conversation, no?

Some aircraft have landed without all their wings.

Fail fast is fine for a dating app, it's not acceptable for a antilock breaking system. As to disk IO they should have kept a redundant disk for backup just in case. Remember a program can generally spend 0.05 seconds waiting and no big deal. A program that takes 300 seconds to reboot is far worse.

Normal software does not crash the entire OS because the antivirus was looking at a file it wanted. We expect better of text editors, for God's sake. There is absolutely no excuse for a life-critical system to fail to meet that rock-bottom standard.
Software should absolutely try to account for these kinds of errors and routinely does. When was the last time you saw a word processor or spreadsheet bork a machine so hard it needed to be restarted just because an AV scan kicked in? ReadFile and ReadFileEx both hand you a specific error if some part of the file you are trying to read is locked by another process because it's hardly rare. 'Halt and catch fire' is not generally considered a proportionate response. I've no idea where you're getting this 'unplugged the disk' thing from, AV software does not work this way.