- They don't do enough or the right kind of smoke tests.
- They don't do exponential-canary deployments with an ability to rollback, and instead just YOLO it.
- They don't appear to have a customer-side security / client platform team update approval gating change control process for software updates or for definitions (or whatever they use).
This is fundamentally laziness and/or incompetency.
They say they run this process multiple times a day. Must be tens of thousands of deployments. I'd guess complacency set in at some point. Just completely inured to the risks they were taking.
Having been through enough procurement cycles as both a buyer and seller there does not need to be a whit of malfeasance for a bad decision to occur. It's aggressive sales, price wars, poorly informed decision makers, gut instinct, favoritism, familiarity, incumbency, network effects.
You notice how this outage affected hospitals and airlines? There is a strong tendency in software sales for industries to align around one or two leaders. Oh, American chose Crowdstrike? Maybe we at Delta should just do what they did. Or literally Delta hires the VP from American to be their CISO and he just does what he did before.
Vendor selection is hard and buyer's remorse is frequently hard to deal with once you've sunk cost into a migration.
Rather point I think is there technical and evaluation gates companies of this nature regularly go through while contracting, part of that is being able to talk the language of the industry properly .
This seems very amateurish for companies who regularly talk professionally to win said contracts , whether the best product or not.
My guess is C-suite, crisis consultants and lawyers are involved heavily so the actual engineering folks have little voice now in any communication and we get stuff like this.
Yeah, I think I'm getting more detailed analysis on Social Media from strangers, which I know I should take with a grain of salt. But I guess I'm expecting a lot more than "a filed caused this" from the company that caused this havoc.
- They don't do enough or the right kind of smoke tests.
- They don't do exponential-canary deployments with an ability to rollback, and instead just YOLO it.
- They don't appear to have a customer-side security / client platform team update approval gating change control process for software updates or for definitions (or whatever they use).
This is fundamentally laziness and/or incompetency.