Hacker News new | ask | show | jobs
by RCitronsBroker 697 days ago
You’re not wrong about the end result, but the breakdown of systems this complex goes deeper than placing the blame on some CrowdStrike employee.

Whoever thought up the great idea to allow auto-update-able kernel modules for something as mission critical as emergency response or healthcare deserves just as much blame. I’ve worked in healthcare for my whole career, this is madness. Not that their process is without flaw, but can we remind ourselves of how stringently we assess medical devices? I cannot imagine it’s controversial to say that emergency response equipment is every bit as critical as a insulin pump. If they fail, someone dies.

2 comments

> Whoever thought up the great idea to allow auto-update-able kernel modules

What's made this whole thing so "interesting" is that the whole point of these "channel files" was to decouple the risk from updating the kernel driver.

Accepted best practice for this product has been to stagger rollout of the kernel driver, so a pilot group gets the current release, the herd get n-1, and sensitive machines get n-2. The product provides for this, and most sites either use it, or admit they should.

So when your pilot group start bluescreening with "DRIVER OVERRAN STACK BUFFER" (actual example from last year), it's caught (by the customer, still) and triaged before it reaches n-1, let alone n-2 & front page of The Times.

But the whole 'sell' of the product is that they get 0-day definitions. So endpoints running the relatively trusted n-2 release still get the same protection against active threats. n-2 have a stable driver running today's "channel data".

I'm not clear if Friday's "channel file" is the issue in itself, or whether it triggered a less-explored code path in the kernel driver - but the result is the same. The best practice of staggering the kernel driver releases, didn't save us from a logic bomb in the "channel file".

I just think the distinction is interesting because following accepted best practices, vendor recommendations, and conservative deployment recommendations did not protect from this. It's not the customers that were yolo'ing this.

It seems like a (possibly obvious?) variation of the Church-Turing Theory that any sufficiently advanced scripting language for a Kernel driver is still a kernel level deployment and should be treated as such. Which is to say that these "conservative deployment recommendations" don't seem conservative enough given what we know of Turing Completeness and how easy it is to break any Turing machine. (I still love that our academia has found an unfixable "0-Day jailbreak" in the Universal Turing Machine itself, proving that this root problem is truly deep in computation theories and reproducible at the most abstract levels.)

(The other recent news that Red Hat has been blaming CrowdStrike for sending eBPF files that also kernel panic on Linux also contributes evidence to this any sufficiently advanced scripting language for kernel drivers is itself a kernel driver-level of deployment risk.)

this was a very valuable insight, I’m a med student at the moment, my interest in networking and tech in general is a tad more shallow, but i appreciate your perspective nonetheless!

Additionally, would you mind sharing your thoughts on the following observations? Afaik, similarly to medical devices, we recognize the criticality of software for applications such as ATC or microcontroller-based railway switchyards; for obvious reasons ofc. Alright, but ensuring the availability of barebones emergency response or Hospital IT shouldn’t be far off in terms of criticality, no?

Yet, ATC, avionics, rail DMIs/infrastructure and similar go through the effort of building ultra-available, purpose built systems that are very different from Windows instances running CS kernel tools, even thoughtful ones.

In contrast, apparently said healthcare/emergency related applications seemingly are okay with relying on mission critical windows boxes. I hope that info is factual, otherwise mea culpa.

I don’t mind healthcare using less elaborate tech for non-critical purposes, the equivalent of the service responsible for providing train delay updates, stuff far away from operating signals type ops. But if its mission critical or able to impede critical services, that’s really worrying to me.

So straight up I have to admit that this isn't my wheelhouse - I support a bunch of developers who seem to enjoy breaking things. Safety-critical or life-critical just isn't my thing. If breaking stuff is half the fun, you probably shouldn't be in medicine ;)

Say you have one server that houses all your patient data, and 1000 workstations that access it. I think it's safe to assume you'd treat that one server as your "crown jewels". You want it to be triple-redundant, you want it to be on battery, generator, a very conservative lifecycle management, replicas in different fire zones, immutable backups, etc etc.

Your thousand desktops .. meh. This is where you want your endpoint protection, this is where you're worried about data egress, etc. They still need to be controlled because they have access to the patient data. But you're not so worried about resilience. If a workstation goes bang, you just go out and image it.

I'd consider this a fairly typical way to evaluate risk and threat.

"I felt a great disturbance in the Force, as if millions of voices suddenly cried out in terror and were suddenly silenced. I fear something terrible has happened."

So on Friday, those thousand workstations simultaneously turned blue. Our hypothetical threat model so far has treated workstations as disposable, replaceable, but didn't consider workstations in their entirety. And once we lose the entirety, all our "crown jewels" are safe on our triple-redundant servers, but there's no way to access them. And the resulting "stop work" is a risk to any patient who really needed that work done today.

Now as I said, this isn't my area at all, I'm spit-balling here, but this is how I understand the fallout from this. An analogy is that we put more effort into protecting the president than the man on the street - but if you wake up one morning and the general population has disappeared, the impact is bigger than losing the president.

i greatly appreciate your respone; thank you for taking the time.
But at the same time, auditing every update to an assurance level beyond ‘it didn’t bsod in test’ is incredibly hard.

I don’t disagree with anything you’ve said, but I’d be very interested in solving the problem of actually auditing constant updates from vendors.

Even a rudimentary “delay autoupdate by two weeks” would have saved lives here. Let everyone else update first.
Automated CI/CD - many of us already do this hundreds of times a day. If you’re an emergency call centre, join a consortium of similar orgs and standardise tech and do it properly.

Defer updates. Most things can wait 8-12 hours. Even more can wait 3 weeks (did this for all but security-critical npm package updates in one place).

Demand legal changes to ensure fair liability for failure to undertake basic measures by service providers for paid software and services. Demand proper liability for C-suites not ensuring that actual risk management is in place instead of stupid box-ticking.

Design better software. Seriously, the kinds of half-baked stuff that costs so much is incredible. It doesn’t take longer, and it doesn’t cost more to do things right, the only change is that management needs to be engaged with outcomes and have skin in the game. Execs should run the risk of going to jail for egregious failures.

staged releases. don't cripple all your systems in one go. hot backups that you only update after the main system isn't dead from an update.