Hacker News new | ask | show | jobs
by hatsunearu 699 days ago
It sounds like Channel files are just basically definition updates in normal antivirus software; it's not actually code, just some stuff on what the software should "look out for".

And it sounds like they shipped some malformed channel file and the software that interprets it can't handle malformed inputs and ate shit. That software happened to be kernel mode, and also marked as boot-critical, so it if falls over, it causes a BSOD and inability to boot.

and it's kind of understandable that channel files might seem safe to update constantly without oversight, but that's just assuming that the file that interprets the channel file isn't a bunch of dogshit code.

4 comments

Configuration files should be treated like code and follow the same gradual rollout practices. See also:

https://sre.google/workbook/canarying-releases/

Which starts with "a majority of incidents are triggered by binary or configuration pushes". The stats for config related failures is one link away at

https://sre.google/workbook/postmortem-analysis/

Where it says 31% of outages in 2010-2017 are caused by "configuration push".

It's not understandable imo. At the very least they should have tests for the loader component that shows it can handle corrupted input. Amateur hour.
Agreed. We all know about a really interesting vector for infecting the kernel now. One that is poorly tested, poorly implemented, and poorly secured.
And though I don't know, I'm guessing it's not a certainty to say they don't contain "code." It would seem to me that they would have to, otherwise novel attacks that weren't caught by one of their existing algorithms could never be detected.

I'm guessing they contain some combination of pattern/regexp type stuff, and interpreted code/scripting with trigger criteria, etc. that all gets loaded into the "engine" that actually runs the threat detection.

Halting problem is undecidable.

On the scale of "no one bothered to put error handling or validation in" to "a subtle problem exists for this given input"; you and I lack the information to make a judgement.

> you and I lack the information to make a judgement.

Think about this a little harder: what do you know about the number of customers affected? We do actually have enough information to make a judgement - bricking millions of critical systems, a very high percentage of their total Windows customer population, tells us that they don’t have progressive rollouts, don’t fail into a safe mode, and that if they do have tests those tests are catastrophically unlike anything their customers run – all they had to do was launch an EC2 instance and see if it kept running.

Not doing fuzzing on user-input supported feature, especially for AV, is damning.
I mean, the whole world was impacted. All they had to do was test this change in a lab with several pcs. Clearly this wasn't a edge case nor a subtle problem. This was clearly a lack of testing.
It was a Friday. Devs just wanted to go home for the weekend.
Leave the spin to the PR people. Their customers pay a great deal of money for 24x7 service, and this wasn’t even a code change but a definition update – a process which should be as well defined and tested as McDonald’s making a hamburger. You wouldn’t excuse getting E. coli from your lunch with “the cook just wanted to go home for the weekend”, and this is a much more expensive service.
Yeah, I re-read my comment and it sounds like I am understanding of them.

But no, saying "channel files aren't kernel code" is just hilarious, considering the channel files define how the actual kernel code is supposed to behave, so it might as well be kernel code. Especially when the bad behavior in question is triggered by bad channel files!!

I was reading these two threads:

https://x.com/perpetualmaniac/status/1814376668095754753?s=4...

https://x.com/ananayarora/status/1814269058088304760

The authors explain the coding error and coredump well, but I'm lost: Is the buggy code that they're describing the channel file, or some kernel code that consumes the channel file? Is there a way to tell?

OK, and another question:-) Can tools like Valgrind and ASan pick up the kinds of errors that are described in those two links from my previous post?
Author of the second post here. The first author's stack trace seems to show a fault on csagent.sys which is a bad read on 0x9c. There are some other .sys files loaded up by csagent.sys, and that's where the crash seems to happen, apparently.

As for detection, Zach mentions that modern tooling could've been used to find this, so I'm assuming Valgrind can find this: https://x.com/Perpetualmaniac/status/1814376690958868979

Hope this helps!

Cheers Ananay!

So if I put this all together:

a) The driver (sensor) csagent.sys includes code that hasn't checked with a tool like Valgrind or ASan or something and so includes some kind of memory management bug.

b) Since n, n-1 and n-2 versions of the sensor all died equally spectacularly, that bug as been around for at least three versions of csagent.sys.

c) The bug can be triggered by getting the csagent.sys to swallow a shitty channel file and since csagent runs in kernel mode, when it crashes it BSOD's the system.

d) Someone at Crowdstrike uploaded a shitty channel file as part of an update process that apparently happens many times a day.

Am I on the right track so far? If so, there's no/inadequate memory management checks in the csagent driver, and either:

1)There were also no checks before the borked channel file was uploaded because of a failure to follow process, or because there was no process, but whatever the case it was an accident.

or

2) Someone uploaded on purpose, not by accident, the borked channel file intending for a nasty outcome (probably not BSOD)

I can't believe that there are not a million checks and balances in place to let (1) happen, but as my grandma used to say, "Don't assume malice where stupidity will do" :-)

> it's not actually code, just some stuff on what the software should "look out for"

If it controls the behavior of a computer, then it's code.

> and it's kind of understandable that channel files might seem safe to update constantly without oversight

Yeah, no, it's not. They pushed an update that crashed the majority of their Windows installed base in a way that couldn't be fixed remotely. It doesn't matter what the update was to. It needed to be tested. There is no way that any deployment pipeline that could fail to catch something that blatant could possibly be "understandable".

... and that kernel mode code shouldn't have been parsing anything with any complexity to begin with. And should have been tested into oblivion, and possibly formally verified.

This is amateur-hour nonsense. Which is what you expect from most of these "Enterprise Cyber Security(TM)" vendors.

... AND the users shouldn't have just gone and shoved that kind of thing into every critical path they could think of.