Hacker News new | ask | show | jobs
by romwell 702 days ago
This reads like a bunch of baloney to obscure the real problem.

The only relevant part you need to see:

>Due to a bug in the Content Validator, one of the two Template Instances passed validation despite containing problematic content data.

Problematic content? Yeah, this is telling exactly nothing.

Their mitigation is "ummm we'll test more and maybe not roll the updates to everyone at once", without any direct explanation on how that would prevent this from happening again.

Conspicuously absent:

— fixing whatever produced "problematic content"

— fixing whatever made it possible for "problematic content" to cause "ungraceful" crashes

— rewriting code so that the Validator and Interpreter would use the same code path to catch such issues in test

— allowing the sysadmins to roll back updates before the OS boots

— diversifying the test environment to include actual client machine configurations running actual releases as they would be received by clients

This is a nothing sandwich, not an incident review.

4 comments

I presume the first two bullet points felt obvious enough to not bother stating: of course you fix the code that crashed. The architectural changes are the more interesting bits, and they're covered reasonably well. Your third point can help but no matter what there's still going to be parts of the interpreter that aren't exercised by the validator because it's not actually running the code. Your fourth one is a fair point: building in watchdogs of some sort to prevent a crashloop would be good. Also having a remote killswitch that can be checked before turning the sensor on would have helped in containing the damage of a crashloop. Your last one I feel like is mostly redundant with a lot of the follow-ups they did commit to.

It's far from perfect (both in terms of the lack of defenses to crashloop in the sensor and in what it said about their previous practices) but calling it a nothing sandwich is a bit hyperbolic.

>I presume the first two bullet points felt obvious enough to not bother stating: of course you fix the code that crashed.

I was not talking about the code that crashed.

I guess what I wrote was non-obvious enough that it needs an explanation:

— fixing whatever produced "problematic content":

The release doesn't talk about the subsystem that produced the "problematic content". The part that crashed was the interpreter (consumer of the content); the part that generated the "problematic content" might have worked as intended, for all we know.

— fixing whatever made it possible for "problematic content" to cause "ungraceful" crashes:

I am not talking about fixing this particular crash.

I am talking about design choices that allow such crashes in principle.

In this instance, the interpreter seemed to have been reading memory addresses from a configuration file (or something that would be equivalent to doing that). Adding an additional check will fix this bug, but not the fundamental issue that an interpreter should not be doing that.

>The architectural changes are the more interesting bits, and they're covered reasonably well

They are not covered at all. Are we reading the same press release?

>Your third point can help but no matter what there's still going to be parts of the interpreter that aren't exercised by the validator because it's not actually running the code.

Yes, that's the problem I am pointing out: the "validator" and "interpreter" should be the same code. The "validator" can issue commands to a mock operating system instead of doing real API calls, but it should go through the input with the actual interpreter.

In other words, the interpreter should be a part of the validator.

>It's far from perfect (both in terms of the lack of defenses to crashloop in the sensor and in what it said about their previous practices) but calling it a nothing sandwich is a bit hyperbolic.

Sure; that's my subjective assessment. Personally, I am very dissatisfied with their post-mortem. If you are happy with it, that's fair, but you'd need to say more if you want to make a point in addition to "the architectural changes are covered reasonably well".

Like, which specific changes those would be, for starters.

>Add additional validation checks to the Content Validator for Rapid Response Content. A new check is in process to guard against this type of problematic content from being deployed in the future.

>Enhance existing error handling in the Content Interpreter.

They did write that they intended to fix the bugs in both the validator and the interpreter. Though it's a big mystery to me and most of the comments on the topic how an interpreter that crashes on a null template would ever get into production.

>They did write that they intended to fix the bugs

I strongly disagree.

Add additional validation and enhance error handling say as much as "add band-aids and improve health" in response to a broken arm.

Which is not something you'd want to hear from a kindergarten that sends your kid back to you with shattered bones.

Note that the things I said were missing are indeed missing in the "mitigation".

In particular, additional checks and "enhanced" error handling don't address:

— the fact that it's possible for content to be "problematic" for interpreter, but not the validator;

— the possibility for "problematic" content to crash the entire system still remaining;

— nothing being said about what made the content "problematic" (spoiler: a bunch of zeros, but they didn't say it), how that content was produced in the first place, and the possibility of it happening in the future still remaining;

— the fact that their clients aren't in control of their own systems, have no way to roll back a bad update, and can have their entire fleet disabled or compromised by CrowdStrike in an instant;

— the business practices and incentives that didn't result in all their "mitigation" steps (as well as steps addressing the above) being already implemented still driving CrowdStrike's relationship with its employees and clients.

The latter is particularly important. This is less a software issue, and more an organizational failure.

Elsewhere on HN and reddit, people were writing that ridiculous SLA's, such as "4 hour response to a vulnerability", make it practically impossible to release well-tested code, and that reliance on a rootkit for security is little more than CYA — which means that the writing was on the wall, and this will happen again.

You can't fix bad business practices with bug fixes and improved testing. And you can't fix what you don't look into.

Hence my qualification of this "review" as a red herring.

> people were writing that ridiculous SLA's, such as "4 hour response to a vulnerability

I didn't see people explaining why this was ridiculous.

> make it practically impossible to release well-tested code

That falsely presumes the release must be code.

CrowdStrike say of the update that caused the crash: "This Rapid Response Content is stored in a proprietary binary file that contains configuration data. It is not code or a kernel driver."

>I didn't see people explaining why this was ridiculous.

Because of how it affects priorities and incentives.

E.g.: as of 2024, CrowdStrike didn't implement staggered rollout of Rapid Response content. If you spend a second thinking why that's the case, you'll realize that rapid and staggered are literally antithetical.

>CrowdStrike say of the update that caused the crash: "This Rapid Response Content is stored in a proprietary binary file that contains configuration data. It is not code or a kernel driver."

Well, they are lying.

The data that you feed into an interpreter is code, no matter what they want to call it.

It's not your kid, so "improve health" is the industry standard response here.
True, but the question is why they can keep getting away with that.
What validates the Content Validator? A Content Validator Validator?
> fixing whatever made it possible for "problematic content" to cause "ungraceful" crashes

Better not only fix this specific bug but continuously use fuzzing to find more places where external data (including updates) can trigger a crash (or worse RCE)

That is indeed necessary.

But it seems to me that putting the interpreter in a place in the OS where it can cause a system crash with the be the behavior that it's allowed to do is a fundamental design choice that is not at all addressed by fuzzing.

An interpreter that handles data downloaded from the internet even. That's an exploit waiting to happen.
I guess "fight fire with fire" is great adage, so why not fight backdoors with backdoors. What can go wrong.
Also “using memory safe languages for critical components” and “detecting failures to load and automatically using the last-known-good configuration”