I can get stories like call centers, but I absolutely don't understand how life critical systems aren't air gapped and rigidly controlled.
Fail safe is the only acceptable failure mode for any critical system. Crowdstrike failed here, but they're not the only thing that can go wrong with computers. Where is the redundancy?
Life-critical systems are air-gapped. Just no one considered systems running Epic to be life-critical. It turns out they are, probably more so than most.
Also, air-gapping helps only so much when network dies and hospitals can't exchange patient information or send images from MRIs and X-rays to radiologists.
>and hospitals can't exchange patient information or send images from MRIs and X-rays to radiologists
My dentist literally took a photo of my x-ray with his phone and sent it to to my orthodontist via Whatsapp and everything went quick and smooth, much faster than the official channels. Solutions to get a job done quickly and efficiently in case of emergency always exist, they're just not "by the book".
Imagine a news story about a dentist that violated HIPPA (or equivalent) laws because they used Whatsapp / Facebook to share medical records. Will this news story be about a hero vs someone who got into trouble?
Hippa doesn't apply in Europe but GDPR, and I don't see how that would be in violation since my information was exchanged only between the two parties with my consent, on an encrypted channel.
They would only get into trouble if that info would leak in an identifiable way to unauthorized third parties and would cause damages (here there's no punitive damages like in the US). And people here tend to guard their WhatsApp chats pretty well since it's what everyone uses and it also contains their private chats so in a sense it can even be more secure than the official medical channels which are just more burocratic but offer no actual guarantee of more data security.
> my information was exchanged only between the two parties with my consent, on an encrypted channel
Say WhatsApp is found to have a security hole that has been leaking data to 3rd parties. What may be the fate of dentists / doctors that decided to use it an "encrypted channel" for medical records? Are doctors / dentists not fat targets for lawsuits? What might the guidance be from their lawsuit insurance policy?
I get that, my point is, why is it absolutely necessary to use the computer system? Why don't they just knock on the door, go grab the medicine and tools, apply it, then fill it into the system later?
I understand they would just postpone whatever can be postponed to save the headache, I don't get the stories about life/health threatening situations.
Have you ever worked a job that requires high degree of physical world logistics? In times where the primary coordination mechanism is down, any action becomes much slower to implement and often at a direct cost to implementing other actions.
With regard to this case, I don't know any specifics, but I can imagine tools require digital calibration, inventories not tracked outside digital systems, certain meds behind digital access control, and emergency response striained to the point where complicated non emergency procedures would be more risk than benefit.
I have managed IT departments that managed hundreds of locations and thousands of computers running Windows XP and Windows Server 2003, no cloud at all. And I went through several similar outages (similar in impact on our operations, not cause or impact on others). Our first priority was to get the critical computers that operated machinery running - we did that hours (1-2) after the problem started. Then we played around with the servers and network for few weeks - but critical stuff was operable, albeit with lesser capacity and efficiency.
And we were managing forests and waterways, not hospitals and human lives.
That's all fine, but this time, no one could get those computers back up in the first few hours, since they were stuck in a boot loop. Plus, systems like hospitals had to be running all that time. Plus, at the scale this outage is reported to be - banks, stores, factories, phones, emergency services, CNC machines, networking, aircon - I imagine everyone was confused and trying to figure out if anything works.
I'm happy nothing significant was hit over here in Poland; reading the main HN thread on the outage feels like reading war reports.
If it's stuck in a boot loop, the first thing I do is call the local admins and tell them to take a fresh SSD and a Windows installation USB drive with them. Plug the new SSD, reinstall the OS and copy the files from the old one. Computer running in less than an hour.
That's literally what we did to restart our forest logging machinery. Are human lives less critical than that?
I can imagine that for something like this procedure, which is an infusion of medication into the brain it sounds like?, that the "tools" to perform the procedure themselves are computer based or computer dependent. It might not be as simple as injecting a drug into an IV line.
Note that I am not a doctor and have absolutely no specific knowledge beyond what is in the original article, but I am guessing at potential explanations.
Additionally, the article states that there is some "wiffle [sic] room" around the timing of the infusions. So it may be that the delay is not quite as serious as the title makes it sound.
Presumably they would fix these computers first thing during the night from a backup? If not, is this really about CrowdStrike, and not about a hospital unable to keep their absolutely critical computers backed up and restored in a timely manner?
Again, I understand that restoring a complex net of servers is hard and takes time. But they surely have local hospital IT admins for these absolutely critical computers who are always available on site and can do it individually - it's not like there will be more than a hundred of these at a particular hospital? Hack it a little if you have to, disable the SSO etc - all that can be fixed later.
The unfortunate fact of the matter is that centralizing IT systems around large corporate products, including the on-prem software and any cloud services, necessarily means less local control of what can go wrong and how it can be mitigated, and thus often problems that simply can't be fixed, even by competent on-prem staff. Even when it is possible, it's often highly illegal, and most organizations do a lot to beat risk-aversion into everyone on their staff, and of course I mean aversion to risk of breaking rules or protocols, not risk like "someone dying"
I think it's always a mistake to outsource control of a mission-critical system, but that is exactly what large tech companies have been encouraging every organization that will listen to them to do for decades now
I have trouble accepting that. Even if they had to unplug the computer from the network and disable SSO and antivirus in safe mode, it's possible to get the computer operational. Even if they had to reinstall the OS and the critical software from scratch. There are solutions, the question is - did they even try? If not, why? And is CrowdStrike really to blame if they didn't? I just don't think so.
Who in the org do you expect to have that competency, and do you think hospitals aren't keeping crucial things like credentials or software that gates access to things in the cloud when literally everyone in the world is encouraged to at every turn?
The culture of organizational IT is broken because a lot of powerful companies found it profitable to break it and leave something inadequate in its place
Hospital IT sucks. Look at a news report about a ransomware or this and it can easily be a few weeks for them to get back in shape. This one is hopefully easier because reportedly CloudStrike can sometimes pull an update before it BSODs and most windows machines auto restart on BSOD, so just leaving things unattended may be enough.
Restore from backup or reimaging fresh often means you need a working backup or image server, which at a lot of these places is also a Windows server and is likely also running the same endpoint protection, and is likely also boot looping.
Restore from zero isn't something any IT wants to do, and many of them aren't prepared to do it either.
Like it or not, hospital care revolves around the electronic medical records systems, and while Kaiser Southern California in the 90s was using amber screens and some sort of mainframe, afaik, almost everyone is on EPIC now, which is a windows application with all the baggage that contains. Even before EPIC took over Kaiser, they were running terminal emulators on Windows.
IMHO, it would be better for them to put together a ground up desktop distribution with exactly what they need, but that has user training costs and development costs.
From having seen the infusion process myself, I take it that it requires precision measurements over an extended period of time. This would seem unreasonable requirement for staff to perform.
Again, from what I've seen, infusions are not just "throw it in an IV bag and wait".
If it requires a computer, why was that operationally critical computer not restored from a backup within hours after the problem started? This has nothing to do with CrowdStrike or other bugs - it could've simply failed hardware wise and the hospital should have been able to replace it immediately.
You have a naive view of how modern operations work, I must say. This shows when you suggest endpoints have backups. We're back to the mainframe/terminal times where all software is running on a web server or other centralized application server, which is also in a boot loop, somewhere else.
Failed hardware is different, but hospitals likely have very few computers just 'lying around'. Especially the highly regulated machines, such as those which are attached to MRIs and the like.
CFR 21 Part 11 was the bane of my existence. Software that can be installed and configured in a matter of minutes? That's a six month project, at least. Sure, backups are great, but then you've got a significant process to get it back up and running.
These aren't early-2000 logging operations.
I see you'll never be convinced, but this is how modern operations work. Being a hospital (or other industry with heavy government regulations) make operations that much worse.
Very few companies, for-profit or otherwise, keep gobs of machinery on hand "just in case". It's expensive, not only the machinery, but the space to store it, maintain it while not in use, replace it when it ages out, and so on. It's also exceedingly rare to need it.
Hospitals also have limited resources in terms of IT staff. It's not a Azure army of operations staff that can rush out to every endpoint and click buttons.
When I was in helpdesk eons ago, I was "responsible" for roughly 300 - 400 endpoints, plus a handful of servers. As were all of the other helldesk techs. If something like this happened, there's simply not enough hands to go around as fast as everyone would like.
Fail safe is the only acceptable failure mode for any critical system. Crowdstrike failed here, but they're not the only thing that can go wrong with computers. Where is the redundancy?