| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by thundershart 705 days ago

Surely, CrowdStrike's safety posture for update rollouts is in serious need of improvement. No argument there.

But is there any responsibility for the clients consuming the data to have verified these updates prior to taking them in production? I haven't worn the sysadmin hat in a while now, but back when I was responsible for the upkeep of many thousands of machines, we'd never have blindly consumed updates without at least a basic smoke test in a production-adjacent UAT type environment. Core OS updates, firmware updates, third party software, whatever -- all of it would get at least some cursory smoke testing before allowing it to hit production.

On the other hand, given EDR's real-world purpose and the speed at which novel attacks propagate, there's probably a compelling argument for always taking the latest definition/signature updates as soon as they're available, even in your production environments.

I'm certainly not saying that CrowdStrike did nothing wrong here, that's clearly not the case. But if conventional wisdom says that you should kick the tires on the latest batch of OS updates from Microsoft in a test environment, maybe that same rationale should apply to EDR agents?

3 comments

kiitos 705 days ago

> But is there any responsibility for the clients consuming the data to have verified these updates prior to taking them in production

In the boolean sense, yes. United Airlines (for example) is ultimately responsible for their own production uptime, so any change they apply without validation is a risk vector.

In pragmatic terms, it's a bit fuzzier. Does CrowdStrike provide any practical way for customers to validate, canary-deploy, etc. changes before applying them to production? And not just changes with type=important, but all changes? From what I understand, the answer to that question is no, at least for the type=channel-update change that triggered this outage. In which case I think the blame ultimately falls almost entirely on CrowdStrike.

link

Brian_K_White 704 days ago

"In which case I think the blame ultimately falls almost entirely on CrowdStrike"

I would say on the client for buying into CrowdStrike.

And also the client for having no contingencies and just accepting a vendor pinky-swear as meaningful.

CrowdStrike failed at their responsibilities too, I just mean that so did everyone else.

When you cede your own responsibilities to someone else and don't have that backed up with contractually enforced liability to make you whole when they fuck up, and also don't provide your own contingency so it doesn't really matter what some vendor does, that's on you. That's 100% entirely on you and it doesn't matter if a million other people also did the same utterly thoughtless and lazy thing.

link

kiitos 704 days ago

> I would say on the client for buying into CrowdStrike.

I understand this perspective but I think it misses the forest for the trees. You have to evaluate this kind of stuff in context. Purity tests really smack on tech message boards where nobody has any accountability to any kind of business requirements, but basically no real-world organization operates in that way, so it's all a bit irrelevant.

> When you cede your own responsibilities to someone else ...

This framing is a bit naive, I think. It isn't a boolean. Everything is about risk management, cost/benefit analysis.

link

thundershart 704 days ago

> From what I understand, the answer to that question is no, at least for the type=channel-update change that triggered this outage. In which case I think the blame ultimately falls almost entirely on CrowdStrike.

Honestly, it hadn't even occurred to me that software like this marketed at enterprise customers wouldn't have this kind of control already available. It seems like an obvious thing that any big organization would insist on that I just took it for granted that it existed.

Whoops.

link

janstice 704 days ago

It seems nuts to me too - MS Defender has this out of the box. From looking at sysadmins on reddit, it seems that CS has a tiered update mechanism, but didn’t use it for this change.

link

cozzyd 705 days ago

Arguably United airlines shouldn't have chosen a product they can't test updates of, though maybe there are no good options.

link

chrisoconnell 704 days ago

>Arguably United airlines shouldn't have chosen a product they can't test updates of, though maybe there are no good options.

I used to work with regional parks and recreation departments, and they would not approve any updates that did not go through UAW environments that we had set up. All updates had to be deployed to their UAW, thoroughly tested, before going to their production environment.

I get this this is slightly different, but I'd imagine Airlines, Banks, and Hospitals would have far more strict UAW policies to avoid a single vendor from kneecapping operations.

link

vel0city 704 days ago

> Does CrowdStrike provide any practical way for customers to validate, canary-deploy, etc. changes before applying them to production?

They do, but this update bypassed all of those rules.

link

jamesfinlayson 704 days ago

Checks out - my company had lots of issues on Friday afternoon, and when it first happened I wondered who on Earth decided to roll out updates to prod systems on Friday afternoon.

No one at my company apparently.

link

suzzer99 704 days ago

Yeah one of the major problems seems to be CrowdStrike's assumptions that channel files are benign. Which isn't true if there's a bug in your code that only gets triggered by the right virus definition.

I don't know how you could assert that this is impossible, hence channel files should be treated as code.

link

stoolpigeon 705 days ago

I think point 3 of the grand parent indicates admins were not given an opportunity to test this.

My company had a lot of Azure vms impacted by this and I'm not sure who the admin was who should have tested it. Microsoft? I don't think we have anything to do with crowdstrike software on our vms. ( I think - I'm sure I'll find out this week.)

Edit: I just learned the Azure central region failure wasn't related to the larger event - and we weren't impacted by the crowd strike issue - I didn't know it was two different things. So my second part of the comment is irrelevant.

link

thundershart 705 days ago

Oh, I'd missed point #3 somehow. If individual consumers weren't even given the opportunity to test this, whether by policy or by bug, then ... yeesh. Even worse than I'd thought.

Exactly which team owns the testing is probably left up to each individual company to determine. But ultimately, if you have a team of admins supporting the production deployment of the machines that enable your business, then someone's responsible for ensuring the availability of those machines. Given how impactful this CrowdStrike incident was, maybe these kinds of third-party auto-update postures need to be reviewed and potentially brought back into the fold of admin-reviewed updates.

link

volkl48 705 days ago

It's not an option. While the admins at the customer have the ability to control when/how revisions of the client software go out (and this, can + generally do their own testing, can decide to stay one rev back as default, etc), there is no control over updates to the kind of update/definition files that were the primary cause here.

Which is also why you see every single customer affected - what you are suggesting is simply not an available thing to do at present for them.

At least for now - I imagine that some kind of staggered/slowed/ringed option will have to be implemented in the future if they want to retain customers.

link