Hacker News new | ask | show | jobs
by throwaway3306a 694 days ago
I get that, my point is, why is it absolutely necessary to use the computer system? Why don't they just knock on the door, go grab the medicine and tools, apply it, then fill it into the system later?

I understand they would just postpone whatever can be postponed to save the headache, I don't get the stories about life/health threatening situations.

5 comments

Have you ever worked a job that requires high degree of physical world logistics? In times where the primary coordination mechanism is down, any action becomes much slower to implement and often at a direct cost to implementing other actions.

With regard to this case, I don't know any specifics, but I can imagine tools require digital calibration, inventories not tracked outside digital systems, certain meds behind digital access control, and emergency response striained to the point where complicated non emergency procedures would be more risk than benefit.

I have managed IT departments that managed hundreds of locations and thousands of computers running Windows XP and Windows Server 2003, no cloud at all. And I went through several similar outages (similar in impact on our operations, not cause or impact on others). Our first priority was to get the critical computers that operated machinery running - we did that hours (1-2) after the problem started. Then we played around with the servers and network for few weeks - but critical stuff was operable, albeit with lesser capacity and efficiency.

And we were managing forests and waterways, not hospitals and human lives.

That's all fine, but this time, no one could get those computers back up in the first few hours, since they were stuck in a boot loop. Plus, systems like hospitals had to be running all that time. Plus, at the scale this outage is reported to be - banks, stores, factories, phones, emergency services, CNC machines, networking, aircon - I imagine everyone was confused and trying to figure out if anything works.

I'm happy nothing significant was hit over here in Poland; reading the main HN thread on the outage feels like reading war reports.

If it's stuck in a boot loop, the first thing I do is call the local admins and tell them to take a fresh SSD and a Windows installation USB drive with them. Plug the new SSD, reinstall the OS and copy the files from the old one. Computer running in less than an hour.

That's literally what we did to restart our forest logging machinery. Are human lives less critical than that?

You might consider that things have changed in the past 20 years. Also that medicine operates differently than forest logging.
Things haven't changed in IT so much. I am not in ICT management anymore, but I write software for the modern enterprise systems and networks - I'm reasonably up to date.

Ad medicine - hence my question, I'd really like to know what's the blocker. So far it seems the blocker is bad IT management, regulation and liability, not impossibility to perform the treatment.

I can imagine that for something like this procedure, which is an infusion of medication into the brain it sounds like?, that the "tools" to perform the procedure themselves are computer based or computer dependent. It might not be as simple as injecting a drug into an IV line.

Note that I am not a doctor and have absolutely no specific knowledge beyond what is in the original article, but I am guessing at potential explanations.

Additionally, the article states that there is some "wiffle [sic] room" around the timing of the infusions. So it may be that the delay is not quite as serious as the title makes it sound.

Presumably they would fix these computers first thing during the night from a backup? If not, is this really about CrowdStrike, and not about a hospital unable to keep their absolutely critical computers backed up and restored in a timely manner?

Again, I understand that restoring a complex net of servers is hard and takes time. But they surely have local hospital IT admins for these absolutely critical computers who are always available on site and can do it individually - it's not like there will be more than a hundred of these at a particular hospital? Hack it a little if you have to, disable the SSO etc - all that can be fixed later.

The unfortunate fact of the matter is that centralizing IT systems around large corporate products, including the on-prem software and any cloud services, necessarily means less local control of what can go wrong and how it can be mitigated, and thus often problems that simply can't be fixed, even by competent on-prem staff. Even when it is possible, it's often highly illegal, and most organizations do a lot to beat risk-aversion into everyone on their staff, and of course I mean aversion to risk of breaking rules or protocols, not risk like "someone dying"

I think it's always a mistake to outsource control of a mission-critical system, but that is exactly what large tech companies have been encouraging every organization that will listen to them to do for decades now

I have trouble accepting that. Even if they had to unplug the computer from the network and disable SSO and antivirus in safe mode, it's possible to get the computer operational. Even if they had to reinstall the OS and the critical software from scratch. There are solutions, the question is - did they even try? If not, why? And is CrowdStrike really to blame if they didn't? I just don't think so.
Who in the org do you expect to have that competency, and do you think hospitals aren't keeping crucial things like credentials or software that gates access to things in the cloud when literally everyone in the world is encouraged to at every turn?

The culture of organizational IT is broken because a lot of powerful companies found it profitable to break it and leave something inadequate in its place

I agree with this sentiment. If you ask me, the entity that comes out looking the worst from this Crowdstrike debacle are the companies that bought their service. Crowdstrike made a poorly designed and maintained product. I heard multiple people on reddit say it's the best of that type of product, but what the hell? Why does it need kernel-level control?

Why did we get here? If you're installing kernel-level software you might as well run a kiosk that only runs presigned code and runs off a read-only system image. And a lot of the machines in question DO APPEAR to be kiosk settings (like hospital data entry terminals).

It's easy to sit back and armchair, I'm sure there will be many cybersecurity experts who would figuratively jump at my throat for suggesting that trusting a vendor to run a rootkit on your computers is a bit incompetent. LOL. :D

I expect the local admins to be able to install a fresh OS not connected to the enterprise network. And I expect them to have physical copies of stuff like disk encryption keys, also backups of OS installations and images, and all critical software. If they don't have that or can't use it during an outage, the problem is incompetent IT management that has no business running a hospital, not CrowdStrike. Something else would take them out sooner or later.

Again, we had all of this for a forest logging operation - is it too much to expect at a hospital?

Absolutely. The risk being managed is the risk to the CEO/CTO's jobs, not the risk to life.
Hospital IT sucks. Look at a news report about a ransomware or this and it can easily be a few weeks for them to get back in shape. This one is hopefully easier because reportedly CloudStrike can sometimes pull an update before it BSODs and most windows machines auto restart on BSOD, so just leaving things unattended may be enough.

Restore from backup or reimaging fresh often means you need a working backup or image server, which at a lot of these places is also a Windows server and is likely also running the same endpoint protection, and is likely also boot looping.

Restore from zero isn't something any IT wants to do, and many of them aren't prepared to do it either.

Like it or not, hospital care revolves around the electronic medical records systems, and while Kaiser Southern California in the 90s was using amber screens and some sort of mainframe, afaik, almost everyone is on EPIC now, which is a windows application with all the baggage that contains. Even before EPIC took over Kaiser, they were running terminal emulators on Windows.

IMHO, it would be better for them to put together a ground up desktop distribution with exactly what they need, but that has user training costs and development costs.

From having seen the infusion process myself, I take it that it requires precision measurements over an extended period of time. This would seem unreasonable requirement for staff to perform.

Again, from what I've seen, infusions are not just "throw it in an IV bag and wait".

If it requires a computer, why was that operationally critical computer not restored from a backup within hours after the problem started? This has nothing to do with CrowdStrike or other bugs - it could've simply failed hardware wise and the hospital should have been able to replace it immediately.
You have a naive view of how modern operations work, I must say. This shows when you suggest endpoints have backups. We're back to the mainframe/terminal times where all software is running on a web server or other centralized application server, which is also in a boot loop, somewhere else.

Failed hardware is different, but hospitals likely have very few computers just 'lying around'. Especially the highly regulated machines, such as those which are attached to MRIs and the like.

CFR 21 Part 11 was the bane of my existence. Software that can be installed and configured in a matter of minutes? That's a six month project, at least. Sure, backups are great, but then you've got a significant process to get it back up and running.

These aren't early-2000 logging operations.

I see you'll never be convinced, but this is how modern operations work. Being a hospital (or other industry with heavy government regulations) make operations that much worse.

You misunderstood me, I am easily convinced that this is the case - what I don't get is how they could let it be the case.
Very few companies, for-profit or otherwise, keep gobs of machinery on hand "just in case". It's expensive, not only the machinery, but the space to store it, maintain it while not in use, replace it when it ages out, and so on. It's also exceedingly rare to need it.

Hospitals also have limited resources in terms of IT staff. It's not a Azure army of operations staff that can rush out to every endpoint and click buttons.

When I was in helpdesk eons ago, I was "responsible" for roughly 300 - 400 endpoints, plus a handful of servers. As were all of the other helldesk techs. If something like this happened, there's simply not enough hands to go around as fast as everyone would like.

What I meant when I said reinstall the PCs was to reinstall the critical computers necessary for operation of medical machinery to make basic and still mostly manual/paper based operation possible, not every computer they have there. I really don't think they have hundreds of computers necessary for operations of MRIs and other machines.
Hotels have difficulty with paper and pen bookings when their computers are down. You expect a modern hospital to function in those circumstances?
the hospital better function.

what you're saying is, if the less important service fails, of course the more important one will fail too.

Yes
It's because these computers are a means of corporate control. Policies and checks and procedures and whatever are all delivered through them.

It's preferable, from the corporate perspective, to have everything fail temporarily than to relinquish this level of workforce management.

If this is hard to imagine, just think of a Lyft driver from the perspective of Lyft Inc.