Hacker News new | ask | show | jobs
by tail_exchange 699 days ago
Can someone who actually understands what CrowdStrike does explain to me why on earth they don't have some kind of gradual rollout for changes? It seems like their updates go out everywhere all at once, and this sounds absolutely insane for a company at this scale.
7 comments

It sounds like Channel files are just basically definition updates in normal antivirus software; it's not actually code, just some stuff on what the software should "look out for".

And it sounds like they shipped some malformed channel file and the software that interprets it can't handle malformed inputs and ate shit. That software happened to be kernel mode, and also marked as boot-critical, so it if falls over, it causes a BSOD and inability to boot.

and it's kind of understandable that channel files might seem safe to update constantly without oversight, but that's just assuming that the file that interprets the channel file isn't a bunch of dogshit code.

Configuration files should be treated like code and follow the same gradual rollout practices. See also:

https://sre.google/workbook/canarying-releases/

Which starts with "a majority of incidents are triggered by binary or configuration pushes". The stats for config related failures is one link away at

https://sre.google/workbook/postmortem-analysis/

Where it says 31% of outages in 2010-2017 are caused by "configuration push".

It's not understandable imo. At the very least they should have tests for the loader component that shows it can handle corrupted input. Amateur hour.
Agreed. We all know about a really interesting vector for infecting the kernel now. One that is poorly tested, poorly implemented, and poorly secured.
And though I don't know, I'm guessing it's not a certainty to say they don't contain "code." It would seem to me that they would have to, otherwise novel attacks that weren't caught by one of their existing algorithms could never be detected.

I'm guessing they contain some combination of pattern/regexp type stuff, and interpreted code/scripting with trigger criteria, etc. that all gets loaded into the "engine" that actually runs the threat detection.

Halting problem is undecidable.

On the scale of "no one bothered to put error handling or validation in" to "a subtle problem exists for this given input"; you and I lack the information to make a judgement.

> you and I lack the information to make a judgement.

Think about this a little harder: what do you know about the number of customers affected? We do actually have enough information to make a judgement - bricking millions of critical systems, a very high percentage of their total Windows customer population, tells us that they don’t have progressive rollouts, don’t fail into a safe mode, and that if they do have tests those tests are catastrophically unlike anything their customers run – all they had to do was launch an EC2 instance and see if it kept running.

Not doing fuzzing on user-input supported feature, especially for AV, is damning.
I mean, the whole world was impacted. All they had to do was test this change in a lab with several pcs. Clearly this wasn't a edge case nor a subtle problem. This was clearly a lack of testing.
It was a Friday. Devs just wanted to go home for the weekend.
Leave the spin to the PR people. Their customers pay a great deal of money for 24x7 service, and this wasn’t even a code change but a definition update – a process which should be as well defined and tested as McDonald’s making a hamburger. You wouldn’t excuse getting E. coli from your lunch with “the cook just wanted to go home for the weekend”, and this is a much more expensive service.
Yeah, I re-read my comment and it sounds like I am understanding of them.

But no, saying "channel files aren't kernel code" is just hilarious, considering the channel files define how the actual kernel code is supposed to behave, so it might as well be kernel code. Especially when the bad behavior in question is triggered by bad channel files!!

I was reading these two threads:

https://x.com/perpetualmaniac/status/1814376668095754753?s=4...

https://x.com/ananayarora/status/1814269058088304760

The authors explain the coding error and coredump well, but I'm lost: Is the buggy code that they're describing the channel file, or some kernel code that consumes the channel file? Is there a way to tell?

OK, and another question:-) Can tools like Valgrind and ASan pick up the kinds of errors that are described in those two links from my previous post?
Author of the second post here. The first author's stack trace seems to show a fault on csagent.sys which is a bad read on 0x9c. There are some other .sys files loaded up by csagent.sys, and that's where the crash seems to happen, apparently.

As for detection, Zach mentions that modern tooling could've been used to find this, so I'm assuming Valgrind can find this: https://x.com/Perpetualmaniac/status/1814376690958868979

Hope this helps!

Cheers Ananay!

So if I put this all together:

a) The driver (sensor) csagent.sys includes code that hasn't checked with a tool like Valgrind or ASan or something and so includes some kind of memory management bug.

b) Since n, n-1 and n-2 versions of the sensor all died equally spectacularly, that bug as been around for at least three versions of csagent.sys.

c) The bug can be triggered by getting the csagent.sys to swallow a shitty channel file and since csagent runs in kernel mode, when it crashes it BSOD's the system.

d) Someone at Crowdstrike uploaded a shitty channel file as part of an update process that apparently happens many times a day.

Am I on the right track so far? If so, there's no/inadequate memory management checks in the csagent driver, and either:

1)There were also no checks before the borked channel file was uploaded because of a failure to follow process, or because there was no process, but whatever the case it was an accident.

or

2) Someone uploaded on purpose, not by accident, the borked channel file intending for a nasty outcome (probably not BSOD)

I can't believe that there are not a million checks and balances in place to let (1) happen, but as my grandma used to say, "Don't assume malice where stupidity will do" :-)

> it's not actually code, just some stuff on what the software should "look out for"

If it controls the behavior of a computer, then it's code.

> and it's kind of understandable that channel files might seem safe to update constantly without oversight

Yeah, no, it's not. They pushed an update that crashed the majority of their Windows installed base in a way that couldn't be fixed remotely. It doesn't matter what the update was to. It needed to be tested. There is no way that any deployment pipeline that could fail to catch something that blatant could possibly be "understandable".

... and that kernel mode code shouldn't have been parsing anything with any complexity to begin with. And should have been tested into oblivion, and possibly formally verified.

This is amateur-hour nonsense. Which is what you expect from most of these "Enterprise Cyber Security(TM)" vendors.

... AND the users shouldn't have just gone and shoved that kind of thing into every critical path they could think of.

This "channel file" is equivalent to an AV signature file. Crowdstrike is the company, the product here is "Falcon" which does behavioral monitoring of processes both on the device and using logs collected from the device in the cloud.

I can see your perspective, but you should consider this: They protect these many companies, industries and even countries at such a global scale and you haven't even heard of them in the last 15 years of their operation until this one outage.

You can't take days testing gradual roll outs for this type of content, because that's how long customers are left unprotected by that content. Although the root cause is on the channel files, I feel like the driver that processes them should have been able to handle the "logic bug" in question so we'll find out more over time I guess.

For example, with windows defender which runs on virtually all windows systems, the signature updates on billions of devices are pushed immediately (with exception to enterprise systems, but even then there is usually not much testing on signature files themselves, if at all). As far as the devops process Crowdstrike uses to test the channel files, I think it's best to leave commentary on that to actual insiders but these updates happen several times a day sometimes and get pushed to every Crowdstrike customer.

> They protect these many companies, industries and even countries at such a global scale and you haven't even heard of them in the last 15 years of their operation

I certainly don't want to know (through disaster news) about the construction company that built the bridge I drive through everyday, not for another 15 years, not ever!

This kind of software simply should not fail, with such a massive install base on so many sensitive industries. We're better than that, the software industry is starting to mature and there are simple and widely-known procedures that could have been used to prevent it.

I have no idea how CrowdStrike stock has only dropped 10% to the values of 2 months ago. Actually, if the financial troubles you get into are only these, take back what I said, software should be failing a lot (why spend money on robustness when you don't lose money on bugs?)

working in software, you should know how insanely complex software is, even google, amazon, microsoft, cloudflare and such have outages. mistakes happen because humans are involved. it is the nature and risk of depending complex systems. bridges by comparison are not that complicated.

I actually expected their stock to drop a lot more than this, but goes to show you how valuable they are. investors know that any dip is only temporary because no one is getting rid of crowdstrike.

Think of the security landscape as early 90's new york city at night and crowdstrike as the big bulky guy with lots of guns who protects you for a fee, if he makes a mistakes and hurts you, you will be mad but in the end your need for protection does not suddenly go away and it was a one time mistake.

In which case "Are you awake and sane?" would be a sensible reality check before heading out.

You're trying to hand-wave away the inexcusable. The outage is a symptom. The problem is the lack of even the most basic testing.

Clearly these files are sent out without even a minimal sanity check. That is a problem, and it's not something that can be hand-waved away.

In the 3-4 decades of the security industry, testing signature files to see if they trigger a corner case system crash has never been practiced. You and others are proclaiming yourselves to be experts in an area of technology you have no experience in. This was not a software update!!
Then that's 3-4 decades of massive incompetence, isn't it? "Testing before pushing an update" is basic engineering, they have a huge scale so huge responsibility, and they have the money to perform the tests and hire people who aren't entirely stupid. That's gross malpractice.
>> You can't take days testing gradual roll outs for this type of content, because that's how long customers are left unprotected by that content.

If you can't take days to do it then do a gradual rollout in hours. It's not a high bar.

they reverted it after about one hour. but sure, they didn't need to target all customers all at once, that's a good point.
> They protect these many companies, industries and even countries at such a global scale and you haven't even heard of them in the last 15 years of their operation until this one outage.

They certainly run their software on those many customers' systems, but but based on my experience with them, "protect" isn't a descriptor I'm willing to grant them.

We don't have the counter-factual where Crowdstrike doesn't exist, but I'm not convinced that they've been a net economic or security benefit to the world over the span of their existence.

Yes, we do have a counter factual, they catch actual APT's they investigated the DNC hack in the 2016 elections and stopped many more attacks. You are utterly clueless in this area to make a comment like that honestly, I don't mean that as an insult but you are talking about a world they don't exist in as if every company has them. Most of their customers get them after getting pwned and learning their lesson the hard way. And availability isn't the only security property their customers desire, keeping information out of threat actors' hands and preventing them from tampering things is also desirable. I really hope you understand that in your hypothetical world without crowdstrike, threat actors still exist.
> Most of their customers get them after getting pwned and learning their lesson the hard way.

Sure, that applies to my company, but the counter-factual isn't "nothing is done and we keep getting pwned", the counter-factual is that instead of the resources spent on crowdstrike and their various problems (which have been regular since we adopted them, the recent mess was just the biggest), those resources are spent on improving security infrastructure without crowdstrike.

Another commenter said that this change was a malformed configuration that crashed the application. If this is the case, you wouldn't need days to see this problem manifest, but only a few minutes. If they had rolled it out to 1% of their customers and waited for a couple hours before releasing it everywhere, they probably would have caught it.
A couple of hours is a long time in the world of automated attacks
It only takes a couple of minutes if you first update your on-site set of LIVE systems sitting there to detect a problem.

If problem encountered, don't send it out to everyone else.

A couple of hours is absolutely nothing compared to the massive worldwide effort that many people have to put in to fix the problem of a company’s shitty product and release practices.

This is inexcusable, point blank. “A couple of hours is a long time” is not a valid excuse when the alternative, as clearly evidenced, is millions of computers and critical systems simultaneously failing hard.

This might have been different if it was a small subset of computers, but this clearly could have been caught in minutes with any sort of sensible testing or canary rollout practices.

I'm guessing they didn't expect content updates to cause such an impact, they've been doing this for 15 years, it is that uncommon. a couple of hours in their world is a long time because their concern is protecting customers as soon as possible. I'm sure they'll do all kinds of tests going forward and be transparent about it. Keep in mind how easy it is for you or I to come to conclusions without understanding or knowing the context they operate in, maybe it will be more clear soon enough.
Then they should make their testing pipelines even faster, and make sure that they can go from detecting a new threat->tested definition file as quickly as possible. You genuinely cannot skimp on testing in this case. It's inherent to the update, threat protection and not breaking their consumers systems should be non-negotiable for a release. That means testing before deploying. If they can't do it fast enough, their product is broken.
An automated attack would struggle to reach the level of destruction that this failure had due the scale of Crowdstrike deployment and the direct update vector and kernel mode failure. Even with the most critical type of remote vulnerability it would be difficult to achieve anything approaching this level of damage, and for all we know (and by all probabilities) this update was addressing a much less severe vulnerability.
Not as long as the weeks it's going to take to undo this.
they are dumb enough to process their "channel files" in kernel, this should be only done in usermode
While I can understand both arguments for and against a gradual rollout, this is the main issue: why do these things need to be processed in kernel? And if there’s a good reason to do it, why isn’t there some kind of circuit breaker?
because the thing that uses them is in kernel mode, and the sensor needs to be performant. at some point, the content must be consumed by the kernel mode sensor. user mode edr's exist but bypassing them is trivial, intercepting syscalls rootkit style and monitoring kernel+usermode memory is the best and most performant way to monitor the whole system.
Apple documentation argues the opposite:

"Developers can use frameworks such as DriverKit and NetworkExtension to write USB and human interface drivers, endpoint security tools (like data loss prevention or other endpoint agents), and VPN and network tools, all without needing to write kexts. Third-party security agents should be used only if they take advantage of these APIs or have a robust road map to transition to them and away from kernel extensions."

Specifically the 2nd sentence above says security software should use the APIs, not Apple's kernel extensions.

well, this is windows not macos. I don't know what you can do with driverkit for example. maybe microsoft should learn from apple?
probably they didn't find solution where they fully trust information coming from usermode process
they need to be processed in kernel mode where the monitoring happens, user mode EDRs are trivial to bypass. they have to be processed by whatever is going to use them, and in this case it is the "lightweight" sensor code in kernel mode.
They need to load data into the kernel eventually but that doesn’t mean that the first time the file is parsed should be in the kernel. For example, on Linux they don’t have this problem because they use the eBPF subsystem and so what’s running in the kernel is validated byte code. Even if they didn’t want to do something that sophisticated they could simply include a validator into the update process, as has been common since the 1980s.
My understanding is they basically deployed a configuration file. It seems like these files might be akin to virus signatures or other frequently updated run-time configuration.

I actually don't think it's outrageous that these files are rolled out globally, simultaneously. I'm guessing they're updated frequently and _should_ be largely benign.

What stands out to me is the fact that a bad config file can crash the system. No rollback mechanism. No safety checks. No safe, failure mode. Just BSOD.

Given the fix is simply deleting the broken file, it's astounding to me that the system's behavior is BSOD. To me, that's more damning that a bad "software update". These files seem to change often and frequently. Given they're critical path, they shouldn't have the ability to completely crash the system.

> I actually don't think it's outrageous that these files are rolled out globally, simultaneously.

Anyone competent that manages software at scale should generally hold the opposite opinion to this.

That’s the danger of running in kernel mode. I’ve seen some people claim this is because the bad file starts a chain of events which concludes in trying to page an unpageable file, which is an application crash in user space but brings down the whole system if it happens in the kernel.
That seems like programming 101 for these systems.

In the past, I've worked around this by validating the configuration of a file before attempting to run it. You bail out in a safe way during validation, but still allow a hard error during run time.

Doesn't prevent all misconfigured files, but prevents the stuff like.

I think it was in the early 90s when I first saw something do A/B style loading where it would record the attempt to load something, recognize that it hadn’t finished, and use the last known good config instead. Anyone studying high-availability systems has a wealth of prior art to learn from.
I think all programmers should have the experience of using and developing on a single-address-space OS with absolutely no protections like DOS, just to encourage them to improve their skills at writing better, actually correct code. When the smallest bugs will crash your system and cause you to lose work, you tend to be a lot more careful with thinking about what your code does instead of just running it to see what happens.
Suggesting “Being more careful” never solves these issues because eventually someone somewhere will have a momentary slip up that causes this.

The real takeaway is that we need to design systems so this kind of issue is less possible. Put less code in the kernel, use tools that prevent these kinds of issues, design computers that can roll back the system if they crash.

Perfect example of where instrumentation guided fuzzing like AFL would almost certainly have found a problem.

I agree with the amateur hour observation. But then most things seem to be.

Entertainingly enough I got to see a similar thing happen, where a configuration file was killing hardware in the field. After the failure and remediation multiple CI jobs were put in place (some months later) to do basic validity checks on the files.

The lesson of "multiple parser implementations for the same thing seems bad" and "sanity checks to prevent breaking things are hard heuristics to define" such that further changes were deferred.

All that to say that I can appreciate circumstances in which satisfying "don't crash the system" in response to configuration data can actually be fairly hard to realize. It can very significantly depend on the design of the pieces in question. But I also agree that it's pretty damning.

I'm more surprised at the fact that they didn't appear to have tested it on themselves first.

FWIW, at least Microsoft still "dogfoods" (and it's what coined that term), and even if the results of that aren't great, I'm sure they would've caught something of this severity... but then again, maybe not[1].

[1] https://news.ycombinator.com/item?id=18189139

This is what really would concern me too. With this wide spread issue any reasonable testing should have detected it. Having a few dozen machines with different configurations for an few hours should have detected this. This should have been in a smoke test.

Push update to machines, observe, power cycle them, observe...

I could understand error in some rarer setup, but this was so common that it should have been obvious error.

Truly, how the extent the damage was so widespread is my main question at this point.

Everyone has a buggy release at some point, but impacting global customers at this level is damn near unforgivable.

Heads need to roll for this oversight.

Because if release immediately, velocity go up
I have a friend who is a security guard at a bank in Hollywood, CA, who told me the computers at his location started going down between 12:00 and 13:00PDT (19:00-20:00UTC).

I don't understand CrowdStrike's rollout system, but given that people started seeing trouble earlier in the day, surely by that time they could have shut down the servers that were serving the updates, or something??

He also told me that soon after that the street outside the bank (another bank across the street, a hospital several blocks down) was lined with police who started barring entry to the buildings unless people had bank cards. By the time I woke up this morning technical people already knew basically what was going on, but I really underestimated how freaked out the average person must have been today.