Hacker News new | ask | show | jobs
by kiitos 705 days ago
> But is there any responsibility for the clients consuming the data to have verified these updates prior to taking them in production

In the boolean sense, yes. United Airlines (for example) is ultimately responsible for their own production uptime, so any change they apply without validation is a risk vector.

In pragmatic terms, it's a bit fuzzier. Does CrowdStrike provide any practical way for customers to validate, canary-deploy, etc. changes before applying them to production? And not just changes with type=important, but all changes? From what I understand, the answer to that question is no, at least for the type=channel-update change that triggered this outage. In which case I think the blame ultimately falls almost entirely on CrowdStrike.

5 comments

"In which case I think the blame ultimately falls almost entirely on CrowdStrike"

I would say on the client for buying into CrowdStrike.

And also the client for having no contingencies and just accepting a vendor pinky-swear as meaningful.

CrowdStrike failed at their responsibilities too, I just mean that so did everyone else.

When you cede your own responsibilities to someone else and don't have that backed up with contractually enforced liability to make you whole when they fuck up, and also don't provide your own contingency so it doesn't really matter what some vendor does, that's on you. That's 100% entirely on you and it doesn't matter if a million other people also did the same utterly thoughtless and lazy thing.

> I would say on the client for buying into CrowdStrike.

I understand this perspective but I think it misses the forest for the trees. You have to evaluate this kind of stuff in context. Purity tests really smack on tech message boards where nobody has any accountability to any kind of business requirements, but basically no real-world organization operates in that way, so it's all a bit irrelevant.

> When you cede your own responsibilities to someone else ...

This framing is a bit naive, I think. It isn't a boolean. Everything is about risk management, cost/benefit analysis.

> From what I understand, the answer to that question is no, at least for the type=channel-update change that triggered this outage. In which case I think the blame ultimately falls almost entirely on CrowdStrike.

Honestly, it hadn't even occurred to me that software like this marketed at enterprise customers wouldn't have this kind of control already available. It seems like an obvious thing that any big organization would insist on that I just took it for granted that it existed.

Whoops.

It seems nuts to me too - MS Defender has this out of the box. From looking at sysadmins on reddit, it seems that CS has a tiered update mechanism, but didn’t use it for this change.
Arguably United airlines shouldn't have chosen a product they can't test updates of, though maybe there are no good options.
>Arguably United airlines shouldn't have chosen a product they can't test updates of, though maybe there are no good options.

I used to work with regional parks and recreation departments, and they would not approve any updates that did not go through UAW environments that we had set up. All updates had to be deployed to their UAW, thoroughly tested, before going to their production environment.

I get this this is slightly different, but I'd imagine Airlines, Banks, and Hospitals would have far more strict UAW policies to avoid a single vendor from kneecapping operations.

> Does CrowdStrike provide any practical way for customers to validate, canary-deploy, etc. changes before applying them to production?

They do, but this update bypassed all of those rules.

Checks out - my company had lots of issues on Friday afternoon, and when it first happened I wondered who on Earth decided to roll out updates to prod systems on Friday afternoon.

No one at my company apparently.

Yeah one of the major problems seems to be CrowdStrike's assumptions that channel files are benign. Which isn't true if there's a bug in your code that only gets triggered by the right virus definition.

I don't know how you could assert that this is impossible, hence channel files should be treated as code.