Hacker News new | ask | show | jobs
by soneil 703 days ago
> Whoever thought up the great idea to allow auto-update-able kernel modules

What's made this whole thing so "interesting" is that the whole point of these "channel files" was to decouple the risk from updating the kernel driver.

Accepted best practice for this product has been to stagger rollout of the kernel driver, so a pilot group gets the current release, the herd get n-1, and sensitive machines get n-2. The product provides for this, and most sites either use it, or admit they should.

So when your pilot group start bluescreening with "DRIVER OVERRAN STACK BUFFER" (actual example from last year), it's caught (by the customer, still) and triaged before it reaches n-1, let alone n-2 & front page of The Times.

But the whole 'sell' of the product is that they get 0-day definitions. So endpoints running the relatively trusted n-2 release still get the same protection against active threats. n-2 have a stable driver running today's "channel data".

I'm not clear if Friday's "channel file" is the issue in itself, or whether it triggered a less-explored code path in the kernel driver - but the result is the same. The best practice of staggering the kernel driver releases, didn't save us from a logic bomb in the "channel file".

I just think the distinction is interesting because following accepted best practices, vendor recommendations, and conservative deployment recommendations did not protect from this. It's not the customers that were yolo'ing this.

2 comments

It seems like a (possibly obvious?) variation of the Church-Turing Theory that any sufficiently advanced scripting language for a Kernel driver is still a kernel level deployment and should be treated as such. Which is to say that these "conservative deployment recommendations" don't seem conservative enough given what we know of Turing Completeness and how easy it is to break any Turing machine. (I still love that our academia has found an unfixable "0-Day jailbreak" in the Universal Turing Machine itself, proving that this root problem is truly deep in computation theories and reproducible at the most abstract levels.)

(The other recent news that Red Hat has been blaming CrowdStrike for sending eBPF files that also kernel panic on Linux also contributes evidence to this any sufficiently advanced scripting language for kernel drivers is itself a kernel driver-level of deployment risk.)

this was a very valuable insight, I’m a med student at the moment, my interest in networking and tech in general is a tad more shallow, but i appreciate your perspective nonetheless!

Additionally, would you mind sharing your thoughts on the following observations? Afaik, similarly to medical devices, we recognize the criticality of software for applications such as ATC or microcontroller-based railway switchyards; for obvious reasons ofc. Alright, but ensuring the availability of barebones emergency response or Hospital IT shouldn’t be far off in terms of criticality, no?

Yet, ATC, avionics, rail DMIs/infrastructure and similar go through the effort of building ultra-available, purpose built systems that are very different from Windows instances running CS kernel tools, even thoughtful ones.

In contrast, apparently said healthcare/emergency related applications seemingly are okay with relying on mission critical windows boxes. I hope that info is factual, otherwise mea culpa.

I don’t mind healthcare using less elaborate tech for non-critical purposes, the equivalent of the service responsible for providing train delay updates, stuff far away from operating signals type ops. But if its mission critical or able to impede critical services, that’s really worrying to me.

So straight up I have to admit that this isn't my wheelhouse - I support a bunch of developers who seem to enjoy breaking things. Safety-critical or life-critical just isn't my thing. If breaking stuff is half the fun, you probably shouldn't be in medicine ;)

Say you have one server that houses all your patient data, and 1000 workstations that access it. I think it's safe to assume you'd treat that one server as your "crown jewels". You want it to be triple-redundant, you want it to be on battery, generator, a very conservative lifecycle management, replicas in different fire zones, immutable backups, etc etc.

Your thousand desktops .. meh. This is where you want your endpoint protection, this is where you're worried about data egress, etc. They still need to be controlled because they have access to the patient data. But you're not so worried about resilience. If a workstation goes bang, you just go out and image it.

I'd consider this a fairly typical way to evaluate risk and threat.

"I felt a great disturbance in the Force, as if millions of voices suddenly cried out in terror and were suddenly silenced. I fear something terrible has happened."

So on Friday, those thousand workstations simultaneously turned blue. Our hypothetical threat model so far has treated workstations as disposable, replaceable, but didn't consider workstations in their entirety. And once we lose the entirety, all our "crown jewels" are safe on our triple-redundant servers, but there's no way to access them. And the resulting "stop work" is a risk to any patient who really needed that work done today.

Now as I said, this isn't my area at all, I'm spit-balling here, but this is how I understand the fallout from this. An analogy is that we put more effort into protecting the president than the man on the street - but if you wake up one morning and the general population has disappeared, the impact is bigger than losing the president.

i greatly appreciate your respone; thank you for taking the time.