Hacker News new | ask | show | jobs
by Avernar 3700 days ago
"Merge says the antivirus froze access to crucial data acquired during the heart catheterization. Unable to access real-time data, the app crashed spectacularly.

The company claims that they included proper instructions in their documentation, advising companies to whitelist Merge Hemo's folders in order to prevent crashes from happening, so it seems that the whole incident was nothing more than an oversight on the medical unit's side."

Here's how I read that: The programmers of this piece of software assumed that some I/O operation would never fail and when it does the program shits itself. So instead of hardening their software to withstand loss of telemetry gracefully, which would cost time and money for the company, they just give instructions to disable scans on their folder.

Odds are good that somewhere this scan will happen (and it did). Either IT doesn't read the release notes or goofs the configuration or an antivirus update clears the white list. Might not even be the antivirus that interferes with the telemetry briefly.

But instead of having resilient software it's "the anitvirus software's fault" or "it's IT's fault" when something goes wrong because of their bad management/engineering decision.

7 comments

Exactly this. As I was reading the article I hoped to find this exact point in the HN comments.

The fault lies in the bad software. It could have been the indexing service, online defrag, automatic updates, or any of the other various background processes windows runs.

If it is critical software, it should be designed in a way to not fail when something non-critical malfunctions, and even the critical pieces should be built with redundancy.

I work for a medical devices company and I just want to say: We, specifically a few of us on the engineering staff, bring this sort of shit up constantly. I go hoarse having the same conversations over and over and over again about robustness in the face of failure, resiliency, redundancy, etc... The truth is that we're beholden to a board and an executive management team that, quite simply, doesn't give a fuck about our problems.

I'm not trying to excuse the company in the article or the company that I work for. And I do not work for the company in the article. I just wanted to point out that I do see how this can happen very easily and repeatedly.

I'm just curious. I work in the automotive sector and develop hardware and software using components that are advertised as functionally safe. I use harden RTOS from vendors who claim their RTOSes are in medical devices as well as military systems.

One such system is Disti (http://www.disti.com/)

In the automotive field, our software is MISRA compliant, static analysis is done (Klockworks - http://www.klocwork.com/) and we follow a very strict set of guidelines outlined in ISO25119 and ISO26262 for the construction and agricultural markets. Think self driving tractors and combines. For example: A tractor traveling down a field with a combine following it a few rows over separating and chopping things into a catcher all done with one person driving.

This shit can't happen where I work. Every component on our circuit boards has a MTTFd of 40 years. Hardware watchdogs can kill the system if software goes awry.

Software is written to readiness level called SRL-1, SRL-2, etc... Unit tests, peer reviews, etc... Functional safety in medical devices is covered under 510(K) (http://www.fda.gov/MedicalDevices/ProductsandMedicalProcedur...)

I find it amazingly short sited that antivirus software is even allowed on a medical device to begin with. I can't even imagine how this system even passed the easiest audit for software readiness.

How is it you "go hoarse having the same conversations?" Do you not have to meet FDA compliance criteria? Are you in the US?

I didn't read the whole article so I'm assuming this happened in the US. For me, we sell autonomous vehicles in the European markets where functional safety seems to be a bit more aggressive there right now for vehicles. Not sure about medical devices.

> I find it amazingly short sited that antivirus software is even allowed on a medical device to begin with.

Well... Then you should consider yourself blessed to have never had to deal with the bureaucracy of a hospital IT department and administrative staff.

Who owns the medical device? Who paid for it? If it's a glorified Windows machine and it's attaching itself to a hospitals WiFi network... Who has to use this machine? Physicians, surgeons, anesthesiologists, radiologists, other specialists, nurses, staff? All of them need to be trained on it's usage, no doubt. They don't get that training in schooling. Who provides it? This and a million other things stack up. So, well, I mean it can start to make sense how these things end up with random AV software installed on them, right?

> How is it you "go hoarse having the same conversations?" Do you not have to meet FDA compliance criteria? Are you in the US?

Yes, we are. Yes, we do "have to meet FDA compliance." I can't define "have to meet" and I work here. Of course, I'm just an engineer. We have legal, executive, and other staff for those matters. I'm sorry, I'm not trying to be an asshole... I'm just trying to be honest about where I find myself in this situation.

Sounds like an awesome job with good engineers but neglectful and irresponsible management.

If you are only making these warnings verbally, you might want to consider emailing your immediate manager with a list of concerns. Make it as neutral as possible and ask for guidance on how they want to address the issues. But if it's on the mail server, it will be good for discovery if the worst happens, and frankly given lives are at stake you probably need to show, in writing, that you were attempting to have the issues addressed.

Who knows? That might actually get traction. Might even save someone's life!

> Yes, we do "have to meet FDA compliance." I can't define "have to meet" and I work here. Of course, I'm just an engineer.

You are not an engineer. This is a protected term in the US and other countries. If you were a professional engineer, you would be bound by a legal and moral framework preventing you from doing work on unsafe medical equipment.

There is a good argument that there should be a software equivalent of protected engineer status for this kind of work. This kind of story should be a wake up call. I personally had no idea that critical medical equipment would be running on MS windows...

Engineer alone is not a protected term in the US. "Professional Engineer" is.

As of 2012 you can take the PE Exam for Software Engineering [1].

[1]: http://ncees.org/about-ncees/news/ncees-introduces-pe-exam-f...

> How is it you "go hoarse having the same conversations?" Do you not have to meet FDA compliance criteria? Are you in the US?

You have to deal with FDA pretty much regardless where you're based, if you want any kind of market for your medical device. A lot of countries define compliance as whatever is good enough for FDA.

FDA rules for software... what FDA wants is a paper trail.

I'm with you...I can see exactly how this can happen.

Unfortunately the only thing that can solve the apathetic board and executive management problem(who only see dollar signs) is the actuality, or realistic possibility, of significant financial loss, or loss of their personal freedom(prison) due to the negligence of the system. And a $10 Mil fine for a fault in something that you make $100 Mil off of is not significant. That's $90 Mil profit in their eyes. And they probably get to write it off.

Even more unfortunate, is that, in the situation that this happens, the "engineers responsible" will be fired, and the executives will resign with a nice golden parachute, and go on to do the same thing somewhere else.

But then you have the company that does do it right, spend the time, and the money to make a truly redundant, fault-tolerant system. But, they come in at a price point 20% higher than their competitor, who doesn't. Which company survives and which doesn't?

Sad, but, unfortunately the way it is. I don't know a practical solution either.

I've thought about this a lot. I've had private conversations with the CEO which lead me to believe that their apathy is a, if not the, primary driver in this situation, at least within the company. Ultimately, they are the single individual who can force these changes in the departments. As things stand today, as far as I can tell, the CEO and the rest of the executive team got theirs and that's that. Anything extra is just that, extra.

We've been close to undergoing "major" scrutiny (as it was sold to me, it was A Big Deal) from the FDA before. I, personally, just a lowly and underpaid engineer, have saved executive staff from having to sign their names on that noose. I had a manager once who seemed to want to push it that far, to stand idle-by while the walls fell down around us. I, unknowingly at the time, prevented it from happening because I was trying to help our customers. I don't regret that decision, actual patients shouldn't have to suffer because of a management teams ineptitude. I do think about it often, though. I understand this is nebulous, and I'm sorry for that. This is a reality, though.

I guess that's the thing that really gets me, the FDA. We sell FDA approved devices. Where the fuck is the FDA? We send them paperwork and they are happy. I can only form the opinion that they, the FDA, are ill prepared to handle this situation; The actual situation, the "the medical devices industry is a fucking train wreck waiting to happen" situation, and especially so they are ill prepared to handle it at scale. Audits are cursory and almost as a rule non-technical. I suppose it'll take a Toyota-level incident to bring change about.

Along the same lines as your 'where the fuck is the FDA' comment -- I've worked in Financial and Healthcare systems on and off for about the last 10 years.

I have seen SSAE16 audited companies that haven't patched anything in years. FDIC examined institutions with ATM machines still running OS/2 Warp(actually probably more secure than the ones running XP, with no updates installed. Ever.)

I once found the management interface of a SAN with a public IP address directly on the device, no firewall rules of any sort, and the device still had the default username/password. It hadn't been patched or rebooted in over 2 years.

More shocking is that a review of the logs didn't show any successful unauthorized logins. Of course, they could have cleaned up after themselves, but further investigation was outside the scope of my engagement(They didn't want to know. They were happy to present that, despite the oversight, there was no indication that PHI had been accessed by unauthorized people. Their conclusion, not mine.)

I can't help responding again. If you have tangible evidence of neglect or regulatory non-compliance, or even risks that are known about but not being dealt with by management - have you considered compiling this material and and reporting it to the FDA?

But as I've said before - I really hope you have written down your concerns to someone in management. If it gets to the point where negligence takes out the company, there's going to be an attempt to make someone a scapegoat. Depending on your role in the company you don't want to be held personally liable for the incompetence and ruthlessness of management...

>Where the fuck is the FDA? We send them paperwork and they are happy.

When regulation becomes more about permission than proficiency, you'll get corruption instead of competence.

> Unfortunately the only thing that can solve the apathetic board and executive management problem(who only see dollar signs) is the actuality, or realistic possibility, of significant financial loss, or loss of their personal freedom(prison) due to the negligence of the system.

Or developers refuse to build software without safety built in.

If they can't hire anyone to build their unsafe systems, they'll have to start building safe software.

Let the market work for you.

That sounds nice...but then you will be replaced by a developer that will toe the company line. You're making 'unreasonable' demands and holding up progress. 'We can fix that with version 2.0'

If every developer on the planet suddenly had a pang of consciousness, then something like this would work.

Fortunately I have never found myself in such a position, but I have seen it many many times.

That's why we should probably require engineering certifications for working on safety-critical software. Working on such software should require demonstrating a certain level of knowledge and upholding a code of ethics.

I generally oppose certification for engineers, but solving collective action dilemmas like this and saving lives in the process is exactly where it would help.

> get to write it off

A fine being tax deductible does not mean zero cost to the company, it means the profit is reduced before taxes are computed, i.e. the actual cost is reduced by the marginal tax rate. A tax credit means zero cost.

It's not a decision which should be made at the level of executives though.

Presumably developers are the one's estimating how long things take. (If they're not, you have even bigger problems and I'm sorry.) The time to make it safe should automatically be included in those estimates.

Moreover, making it safe shouldn't be a separate part of the process. It should just be part of how you write software. It's either safe or it doesn't exist at all. (Compare this to how organizations like Google deal with concurrency: it's built in from the start.)

A reputable engineer wouldn't design and build a bridge which might collapse. A developer shouldn't build software which puts lives at risk, regardless of management pressure.

If they refuse to relent, there are plenty of jobs where safety isn't critical.

> Presumably developers are the one's estimating how long things take.

This is not meant as a slight: I think you're grossly unfamiliar with software development outside of engineering-driven companies.

It's pretty much a guarantee that product managers are deciding these estimates. They might confirm with the developers, but the conversation probably went something like this:

"Does 3 weeks sound about right for this?"

"No, we'll need 6"

"Why?"

"Safety checks"

"Ok, we don't have 6 weeks. I can give you 4, but we're just gonna have to make do."

Is it scary that conversation happened about a piece of medical software? Absolutely. Would I bet $1k that it happens frequently? Absolutely.

> A reputable engineer wouldn't design and build a bridge which might collapse

Rarely does a single engineer design a bridge nowadays, so corporate liability and reputation (good luck landing more contracts if your bridge collapses) is a huge factor in much of that beyond simple ethics.

I would be shocked if anything happened to Merge as a result of this, whereas a company who designed a faulty bridge would be sued into oblivion.

Further, professional engineering in the US is a whole different game that involves licensing and regulations specifically to avoid that situation. Software "engineering" has no such equivalent currently.

Pinning the blame on the peons is a sure-fire way to make sure this situation never changes.

Oh, I'm well aware of the difficulty of negotiating with product managers over timelines.

The difference is that they never should get the decision to cut safety checks. Cutting safety checks should be as ludicrous/impossible as writing half the code of each function to cut time.

The conversation should go like this:

PM: "Does 3 weeks sound about right for this?"

Dev: "No, we'll need 6"

PM: "Why?"

Dev: "That's how long it takes to build those 6 features."

PM: "Ok, we don't have 6 weeks. I can give you 4, but we're just gonna have to make do."

Dev: "Okay, which features would you like to cut?"

> Further, professional engineering in the US is a whole different game that involves licensing and regulations specifically to avoid that situation.

I'm aware. While I don't think the majority of software developers should be certified, we should require licensing for working on safety-critical applications.

> The conversation should go like this:

I think you're missing the end to that conversation:

>PM: "Ok, we don't have 6 weeks. I can give you 4, but we're just gonna have to make do."

> Dev: "Okay, which features would you like to cut?"

PM: We can't cut any of them. We need features A,B,C in the product and we need it in 4 weeks.

Here we insert a rant from the PM about one of the following:

1) Leadership

2) Hard work

3) Threats about job security

4) Recalling that one time you delivered something ahead of schedule so why is this different

5) I see you getting up to get coffee at least twice a day so stop goofing off and get it done

I think you're vastly overestimating how much power/control said Dev has over the whole process at these sorts of companies.

Sure, they can quit, but if they felt empowered to quit they probably wouldn't be there in the first place: I don't think anyone's busting down the door to work at MedicalBusinessTM.

> we should require licensing for working on safety-critical applications.

Fully agreed, though with some misgivings.

Incorporating safety-critical software into the "professional engineering" spectrum would almost certainly require some things that are seen as near-heresy to the software community, like requiring a 4-year degree from an ABET-accredited program.

Still, I agree.

I agree with your sentiment 100%, but if you insist on doing things safely while your colleagues do not, you might get a reputation for being slow and be earmarked for replacement. Perhaps it's worth losing a job over, but your replacement will cut corners so the net effect is patients unsafe + you have no job. It feels reminiscent of the prisoner's dilemma.
This. I'm in the same situation.
There are lots of faults. The software failed. The process that directed a helpdesk tech to install AV was a failure of some manager. The decision to engineer systems and networks in a way such that AV seemed like a good idea was a failure of an architect.
In my opinion, the software failed because the entire system (software, hardware, and humanware) failed to implement, holistically, a safety critical system. You simply cannot ignore the system as a whole. I'd wager we are in violent agreement. :)
>The decision to engineer systems and networks in a way such that AV seemed like a good idea was a failure of an architect.

As ever, a relevant xkcd: https://xkcd.com/463/

I build software that does the exact same thing. We're running automotive tests, and our management/customers are unwilling to invest in solutions that will work in spite of the fact that Windows is not a real-time OS.

We use a National Instruments DAQ card, and need the PC to respond within 50 ms to issue new commands for hours or days. Remarkably, it usually (over hundreds of machines and decades of operation) does. When it doesn't, it's blamed on antivirus or firewall or technicians using the PC for other things while the software runs.

National Instruments provides real-time IO systems, but they cost a lot more than the basic systems. You can write driver-layer code that will run in real-time on Windows, but that takes longer.

Our customers and management, with varying levels of comprehension of the problem, elect to not spend that money. I hate to say it, but if we didn't make this compromise, there are competitors who would.

> We use a National Instruments DAQ card, and need the PC to respond within 50 ms to issue new commands for hours or days. Remarkably, it usually (over hundreds of machines and decades of operation) does. When it doesn't, it's blamed on antivirus or firewall or technicians using the PC for other things while the software runs.

It works as long as full code path and data it requires is not paged out. Or some other thread doesn't consume I/O resources, etc.

In other words, it's not guaranteed at all.

Only way to get Windows to react reliably within 50 ms is in a kernel driver, as response to an IRQ. There's considerable jitter even in IRQ, but usually worst case service times are 200-500 microseconds. Depends a lot on other devices and on your IRQ priority. It's worse for passive level drivers (IRQL == 0).

50 ms guaranteed response time requires the code and data is in non-paged pool.

Software shouldn't necessarily try to account for errors in that manner. Usually, the most graceful thing to do is to exit cleanly.

For example, if there is a massive amount of data, it has to be stored on disk. It's too large to keep in memory. And if the point of the program is to transform that data in real time, then it has to have access to the disk.

The antivirus basically unplugged the disk. What can it do to recover? There's nothing to be done.

It should be able to survive that situation, of course. When the disk is plugged back in, it should be able to restart without any problems. But I think that's a different kind of resiliency than what you're referring to.

In this case, the only way to recover would be to copy the frozen data to a new area of the hard drive, assuming it retained read access. But such complexities result in brittle implementations, prone to acquiring bugs. What if the disk space runs out? So you check beforehand whether there's enough space. But what if some other program starts consuming disk space in the middle of your copy operation? And so on. It's an endless spiral of design complexity.

The situation in the article seems closer to hardware failure than a design oversight.

> When the disk is plugged back in, it should be able to restart without any problems. But I think that's a different kind of resiliency than what you're referring to.

Yes and no. I was referring to restarting internally when the error condition went away but restarting the app and waiting for telemetry to return can be a valid solution.

Think of your torrent software. If you crank your firewall to block it while it's running it will not crash. If your disk fills up it won't crash. When the network comes back or more drive space if freed it will restart it's internal mechanisms. You wouldn't want it to restart in these conditions. If it runs out of memory however choosing to exit might be the best recovery mechanism.

I think a life critical medical application can at least strive for internal restart and do an external restart if all else failed. The article stated they had to reboot the machine to get it back. Now that's way worse.

> The situation in the article seems closer to hardware failure than a design oversight.

Hardware failure is almost always a permanent condition. This was a "my I/O stopped briefly and would have came back if my code could handle it".

During a surgery, the program doesn't have the luxury of showing a screen that says "No telemetry available." Such a program would be considered equally unreliable. Worse, it would lead to confusion: "Why is the telemetry unavailable? What does 'Error Code 2931' mean?"

A spectacular crash immediately led to pinpointing the problem: The antivirus.

If the program's sole purpose is to transform a massive amount of data in real time, it must have disk access by definition. It can't not have disk access. What would you suggest it do?

Yes it does! Showing "no telemetry available" is exactly what it should do. Crashing = unreliable. Reporting an error condition = reliable.

Immediately? Took them 5 minutes to reboot the computer. The scan of the folder would take seconds let alone minutes. Pinpointing the problem is secondary. Not killing the patient is primary.

> If the program's sole purpose is to transform a massive amount of data in real time, it must have disk access by definition. It can't not have disk access.

And that is the mind set the programmers of the software had. You have to take care of error conditions. The processing can't have no disk access but no disk access can occur temporarily or permanently. What can you do? Pause the processing part of your program. Or make the processing part treat "no data" as valid input and display something else.

Imagine taking that viewpoint with an ECG machine: This machine displays a heart rate waveform. So it must have a heart rate input. If there is no heart rate we'll just crash requiring a 5 minute reboot.

Hell no! Draw a straight line and set off a buzzer!

I agree with you, but the flat line might not be the best example because that has a very specific meaning (asystole) that doctors will take certain actions based on without necessarily trying to verify it manually when time is already critical. You should never be able to confuse an error message for anything else.
> You should never be able to confuse an error message for anything else.

Exactly. Which is why "Can't read file sensors.dat" is way better than just crashing. Crashing is one of the worst error messages you can get because you don't know what happened.

" Avernar 18 hours ago

Yes it does! Showing "no telemetry available" is exactly what it should do. Crashing = unreliable. Reporting an error condition = reliable.

Immediately? Took them 5 minutes to reboot the computer. The scan of the folder would take seconds let alone minutes. Pinpointing the problem is secondary. Not killing the patient is primary."

Well-put. Far back as Burroughs B5000, the best way to handle erroneous software or I/O was to freeze it, notify the administrator/user of the problem, and give them sensible options for how to proceed. They might restart the I/O, restart the app, modify erroneous data to proceed (rare here), and so on. Crash and reboot is a Windows 95/NT strategy where incompetence dominated. Today's Windows OS and tooling can do much better with little effort by developers.

But the spectacular crash didn't pinpoint the problem. That came afterwards, when the manufacturer was able to look into the crash.

In a situation like this, confusion is all but inevitable. As a developer, the goal should be to minimize that confusion to the greatest extent possible. A blank screen and crash introduces another step to the process as people wonder "what's going on?" instead of "shit, it threw an error." It's probably not a big deal, but with medical devices during surgery, that extra step could be hugely problematic.

I wonder what surgeons think about software engineers? Except for open heart surgery, they don't normally do their fixes by stopping and starting the thing they are repairing...
I totally disagree.

Sure, usually the most graceful thing to do is exit and hope a human fixes it. But that's usual because the usual condition is that sudden failure is NBD and a human is right there to screw with it.

That's becoming less common, though. When software was mostly something running on a PC doing some boring office task, reliability didn't matter. But as software is running our airplanes, our cars, our medical devices, and even, as with implanted pacemakers and insulin pumps, our bodies, then reliability gos from NBD to BFD.

We see the way forward with things Chaos Monkey [1] and crash-only software [2] and the sort of design for failure you see in things like Agent supervisor hierarchies [3], where the way to reliability is through designing for failure recovery from the beginning and testing thoroughly to make sure it really happens.

[1] https://github.com/Netflix/SimianArmy/wiki/Chaos-Monkey

[2] https://en.wikipedia.org/wiki/Crash-only_software

[3] http://doc.akka.io/docs/akka/snapshot/scala/fault-tolerance....

It wasn't a rhetorical question. What could this program possibly do to recover?

If the CPU fails, no one would say the program was unreliable.

In this case, the disk failed, because the antivirus unplugged it. Was the program unreliable?

The disk did not fail. An I/O operation failed. The first is a permanent condition, the second can be permanent or transient. Big difference.

In Linux a signal can cause an I/O to fail. In Windows it's antivirus and other background tasks can cause I/O to fail.

What can it do to recover? Retry the I/O operation! It should keep trying until the operator tells it to stop.

In that scenario, your surgeon would see the program suddenly freeze.

The program likely looks like this: data acquisition -> transformation -> display transformation on monitor.

If the transformation step fails, the monitor will end up displaying (a) nothing, (b) random data, or (c) the most recent image. None of these help the surgeon continue surgery. It's the same as a crash.

If your environment fails, there's nothing you can do to recover. Planes aren't designed to survive the loss of a wing. Why is this case any different?

> In that scenario, your surgeon would see the program suddenly freeze.

Only if the programmer or his management were incompetent. The display routine should be running on a separate thread than the processing code. No whole program freeze should occur.

As for displaying random data, why would the programmer want to do this? Either display nothing or the last readings WITH a message that it's not real time.

It's not the same as a crash! A crash requires 5 minutes minimum guaranteed. Restarting instantly after telemetry returns can happen under a second in the best case which can be the difference between a live and dead patient.

> If your environment fails, there's nothing you can do to recover. Planes aren't designed to survive the loss of a wing. Why is this case any different?

There are different kinds of failure. Permanent and transient. Following the permanent procedure for a transient case can be fatal.

Take your airplane example. Loss of a wing is permanent. That would be like the CPU failing or an external cable being cut.

But your engines shutting down can be permanent or transient. Just like disk I/O failing. You'd use the transient procedure in this case. Keep trying to restart the engines. If they restart, great! You've just saved the plane.

Same with the disk I/O. The programmer should keep trying to restart the I/O. If it comes back, great! You've just saved the patient.

Some aircraft have landed without all their wings.

Fail fast is fine for a dating app, it's not acceptable for a antilock breaking system. As to disk IO they should have kept a redundant disk for backup just in case. Remember a program can generally spend 0.05 seconds waiting and no big deal. A program that takes 300 seconds to reboot is far worse.

Normal software does not crash the entire OS because the antivirus was looking at a file it wanted. We expect better of text editors, for God's sake. There is absolutely no excuse for a life-critical system to fail to meet that rock-bottom standard.
Software should absolutely try to account for these kinds of errors and routinely does. When was the last time you saw a word processor or spreadsheet bork a machine so hard it needed to be restarted just because an AV scan kicked in? ReadFile and ReadFileEx both hand you a specific error if some part of the file you are trying to read is locked by another process because it's hardly rare. 'Halt and catch fire' is not generally considered a proportionate response. I've no idea where you're getting this 'unplugged the disk' thing from, AV software does not work this way.
You can't get enough up votes. Crashing because an I/O operation fails? That's sounds like simply a bug in the software. The developer didn't handle an error properly, and QA didn't test the software on an environment with elevated I/O activity. I've done enough code reviews over the years and seen enough ignoring errors from read(), ignoring malloc() returning null, not handling exceptions, etc. Good developers give a shit but many just don't care at all and think crashing or exiting when you're out of disk space is just fine.
Better to crash (and restart quickly into a known state) than to enter a rare, untested code path.
For software critical to human life, test the rare code paths.
The easiest way to be sure a code path will run properly is to avoid writing it in the first place. This kind of application should be designed to run in a highly linear, predictable fashion on robust, fault-tolerant hardware.

Why is nobody questioning the propriety of using an off-the-shelf Windows PC in safety-of-life applications?

> The easiest way to be sure a code path will run properly is to avoid writing it in the first place.

Agreed. So the app shouldn't contain anything extra not related to it's primary function.

However, handling error conditions reported by the operating trumps the extraneous code rule. But there are many ways to handle an error, including ignoring it if that's the proper thing to do.

Crashing is never the proper thing to do. If the program had simply exited at the very minimum, a restart would have taken a lot less time than a complete reboot of the machine. The software crashed that badly that it required a reboot of the machine.

> Why is nobody questioning the propriety of using an off-the-shelf Windows PC in safety-of-life applications?

They are, in the other threads. But using a better OS for the task wouldn't prevent the coding error the programmer did.

Let's say they chose Linux. A signal goes off or something else happens and their read call fails. Since they expect all their I/O to succeed they crash just like the Windows box.

If they bothered to handle the error and check for EINTR they'd know it was interrupted and not a hardware failure.

My point is, changing operating systems doesn't protect you from poorly coded applications.

Agreed, I'm not blaming the OS, just saying it's the wrong tool for this sort of job.
> This kind of application should be designed to run in a highly linear, predictable fashion on robust, fault-tolerant hardware.

No, that's a recipe for failure. Hardware will fail. Period.

Hoping that nothing will fail and therefore not taking steps to mitigate it is akin to designing cars not to crash. [0]

Failures need to be explicitly designed for and tested. It's truly depressing that companies where failure is fine (ex. Netflix's Chaos Monkey) understand this, while companies where failure is deadly don't.

[0] https://news.ycombinator.com/item?id=11652940

Restarting quickly into a known state is not crashing. That is handling the error. How much of the program you restart is the question. Restart just a thread is better than restarting the whole program.

But their app crashed. And hard. It required a machine reboot to restart. While it returned the machine to a known state it wasn't quick.

And in medical software, all code paths need to be tested.

File I/O error handlers should not be a rare untested code path.
Agreed. This is what happens when software fails catastrophically. People die and we lose confidence in software that could save lives if it worked properly.
You'd think it would take longer to write the instructions than it would to just throw a try-catch block in there.