Hacker News new | ask | show | jobs
by openasocket 697 days ago
I work on a piece of software that is installed on a very large number of servers we do not own. The crowd strike incident is exactly our nightmare scenario. We are extremely cautious about updates, we roll it out very slowly with tons of metrics and automatic rollbacks. I’ve told my manager to bookmark articles about the crowdstrike incident and share it with anyone who complains about how slow the update process is.

The two golden rules are to let host owners control when to update whenever possible, and when it isn’t to deploy very very slowly. If a customer has a CI/CD system, you should make it possible for them to deploy your updates through the same mechanism. So your change gets all the same deployment safety guardrails and automated tests and rollbacks for free. When that isn’t possible, deploy very slowly and monitor. If you start seeing disruptions in metrics (like agents suddenly not checking in because of a reboot loop) rollback or at least pause the deployment.

2 comments

I don’t have much sympathy for CrowdStrike but deploying slowly seems mutually exclusive to protecting against emerging threats. They have to strike a balance.
Even a staged rollout over a few hours would have made a huge difference here. "Slow" in the context of a rollout can still be pretty fast.
But it can also still be way too slow in the context of an exploit that is being abused globally.
Sure but GP is praising "deploy so slowly that people complain."
Seriously like rolling out on some exponential scale even over the course of 10 minutes would have stopped this dead in its tracks
In CrowdStrikes case, they could have rolled out to even 1 million endpoints first and done an automated sanity/wellness check before unleashing the content update on everyone.

In the past when I have designed update mechanisms I’ve included basic failsafes such as automated checking for a % failed updates over a sliding 24-hour window and stopping any more if there’s too many failures.

They need a lab full of canaries.
yeah, I don't get the "we couldn't have tested it" crap, because "something happens to the payload after we tested it". Create a fake downstream company and put a bunch of machines in it. That's your final test before releasing to the rest of the world.
> let [...] owners control when to update

The only acceptable update strategy for all software regardless of size or importance