| Software shouldn't necessarily try to account for errors in that manner. Usually, the most graceful thing to do is to exit cleanly. For example, if there is a massive amount of data, it has to be stored on disk. It's too large to keep in memory. And if the point of the program is to transform that data in real time, then it has to have access to the disk. The antivirus basically unplugged the disk. What can it do to recover? There's nothing to be done. It should be able to survive that situation, of course. When the disk is plugged back in, it should be able to restart without any problems. But I think that's a different kind of resiliency than what you're referring to. In this case, the only way to recover would be to copy the frozen data to a new area of the hard drive, assuming it retained read access. But such complexities result in brittle implementations, prone to acquiring bugs. What if the disk space runs out? So you check beforehand whether there's enough space. But what if some other program starts consuming disk space in the middle of your copy operation? And so on. It's an endless spiral of design complexity. The situation in the article seems closer to hardware failure than a design oversight. |
Yes and no. I was referring to restarting internally when the error condition went away but restarting the app and waiting for telemetry to return can be a valid solution.
Think of your torrent software. If you crank your firewall to block it while it's running it will not crash. If your disk fills up it won't crash. When the network comes back or more drive space if freed it will restart it's internal mechanisms. You wouldn't want it to restart in these conditions. If it runs out of memory however choosing to exit might be the best recovery mechanism.
I think a life critical medical application can at least strive for internal restart and do an external restart if all else failed. The article stated they had to reboot the machine to get it back. Now that's way worse.
> The situation in the article seems closer to hardware failure than a design oversight.
Hardware failure is almost always a permanent condition. This was a "my I/O stopped briefly and would have came back if my code could handle it".