Hacker News new | ask | show | jobs
by kayo_20211030 703 days ago
This isn't right. If I need a system to run with a piece of code, then it shouldn't run at all if that piece of code is broken. Ignoring the failure is perverse. Let's say that the driver code ensures that some medical machine has safety locks (safeguards) in place to make sure that piece of equipment won't fry you to a crisp; I'd prefer that the whole thing not run at all rather than blithely operate with the safeguards disabled. It's turtles all the way down.
6 comments

I think the premise is false? It's up to the eBPF implementor what to do in the case of invalid input; the kernel could choose to perform a controlled shutdown in that case. (I have no idea what e.g. Linux actually does here, but one could imagine worlds where the action it takes on invalid input is configurable.)

Also your statement is sometimes not true, although I certainly sympathise in the mainline case. In some contexts you really do need to keep on trucking. The first example to spring to mind is "the guidance computers on an automated Mars lander"; the round-trip to Earth is simply too long to defer responsibility in that case. If you shut down then you will crash, but if you do your best from a corrupted state then you merely probably crash, which is presumably better.

> I have no idea what e.g. Linux actually does here

If you attempt to load an eBPF program that the verifier rejects, the syscall to load it fails with EINVAL or E2BIG. What your user-space program then does is up to you, of course.

The medical machine software should just refuse to run with an error message if a critical driver was not loaded. The OS bricking is causing way more trouble where an IT technician now needs to fix something where it otherwise would just be updating the faulty driver... Also does your car not start if you are missing water for the wiper?
Water for the wiper is userland feature.

3rd party hooking into kernel is 3rd party responsibility. It is like equipping your car with LPG - THAT hooks into engine (kernel). And When I had a faulty gas pressure sensor then my car actually halted (BSOD if you will) instead of automatically failing over to gasoline as it is by design.

You can argue that car had no means to continue execution but kernel has, however invalid kernel state can cause more corruption down the road. Or as parent even points out - carry out lethal doses of something.

Initially I was inclined to disagree ("these things should always fail safe") however with more and more stuff being pushed into the kernel it's hard to say that you're wrong or exactly where a line needs to be drawn between "minimally functional system" and "dangerously out of control system".

I think until we discover a technology that forces commercial software vendors to employ functioning QA departments none of this will really solve anything.

I agree that some system components should be treated as critical no matter what, but the software at issue in this case (Falcon Sensor or Antivirus more generally) is precautionary and only best effort anyways. I would wager the vast majority of the orgs affected on Friday would have preferred the marginally increased risk of a malware attack or unauthorized use over a 24 hour period instead of the total IT collapse they experienced. Further, there's no reason the bug HAD to cause a BSOD, it's possible the systems could have kept on trucking but with an undefined state and limitless consequences. At least with eBPF you get to detect a subset of possible errors and make a risk management decision based on the result.
I'm with you. What's critical, and what's not? Is it a big thing, or not a big thing? Is this particular machine more critical than the one over there? Security systems need to be at the lowest level, or else some shifty bastard will find a path around them. If it's at the lowest level, the downside of a failure is catastrophic, as we experienced last Friday. The carnage here is ultimately on CrowdStrike. The testing must have been slapdash at best, and missing at worst. eBPF changes nothing. The question is: should we fail, or carry on? eBPF doesn't help with that decision, it only determines the outcome from a system perspective. Any decision is a value judgement; it might be right or wrong, and its outcome either benign or deadly. Choices!
I like how Unison works for this reason. You call functions by cryptographic hash, so you have some assurance that you're calling the same function you called yesterday.

Updates would require the caller to call different functions which means putting the responsibility in the hands of the caller, where it should be, instead of on whoever has a side channel to tamper with the kernel.

You end up with the work-perfectly-or-not-at-all behavior that you're after because if the function that goes with the indicated hash is not present, you can't call it, and if it is present you can't call it in any way besides how it was intended

The system clearly already behaves that way (i.e. ignores failure) - after all, the fix was to simply delete the offending file. If that's an option, then loader can do that too. It can and perhaps even is smarter, such as "fallback onto previous version".

Furthermore, the reaction to a malformed state need not be "ignore". It could disable restricted user login; or turn off the screen.

If the worry is that this is viable to abuse by malware, well, if the malware can already rewrite the on-disk files for the AV, I wonder whether it's really a good idea to trust the system itself to be able to deal with that. It'd probably be safer to just report that up the security foodchain, and potentially let some external system take measures such as disable or restrict network access. Better yet, such measures don't even require the same capabilities to intervene in the system, merely to observe - which makes the AV system less likely to serve as a malware vector itself or to cause bugs like this.

> Ignoring the failure is perverse.

If the failed system is a security module, I think that's absolutely correct. If the system runs, without the security module, well, that's like forgetting to pack condoms on Shore Leave. You'll likely be bringing something back to the ship with you.

Someone needs to be testing the module, and the enclosing system, to make sure it doesn't cause problems.

I suspect that it got a great deal of automated unit testing, but maybe not so much fuzz and monkey (especially "Chaos Monkey"-style) testing.

It's a fuzzy, monkey-filled world out there...

Interesting analogy, but yes. If the module *is* necessary, well, it's necessary and nothing should work without it. Testing must have been a mess here.