| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by openasocket 697 days ago
	I work on a piece of software that is installed on a very large number of servers we do not own. The crowd strike incident is exactly our nightmare scenario. We are extremely cautious about updates, we roll it out very slowly with tons of metrics and automatic rollbacks. I’ve told my manager to bookmark articles about the crowdstrike incident and share it with anyone who complains about how slow the update process is. The two golden rules are to let host owners control when to update whenever possible, and when it isn’t to deploy very very slowly. If a customer has a CI/CD system, you should make it possible for them to deploy your updates through the same mechanism. So your change gets all the same deployment safety guardrails and automated tests and rollbacks for free. When that isn’t possible, deploy very slowly and monitor. If you start seeing disruptions in metrics (like agents suddenly not checking in because of a reboot loop) rollback or at least pause the deployment.

2 comments

taspeotis 697 days ago

I don’t have much sympathy for CrowdStrike but deploying slowly seems mutually exclusive to protecting against emerging threats. They have to strike a balance.

link

zavec 697 days ago

Even a staged rollout over a few hours would have made a huge difference here. "Slow" in the context of a rollout can still be pretty fast.

link

bboygravity 697 days ago

But it can also still be way too slow in the context of an exploit that is being abused globally.

link

taspeotis 697 days ago

Sure but GP is praising "deploy so slowly that people complain."

link

getcrunk 697 days ago

Seriously like rolling out on some exponential scale even over the course of 10 minutes would have stopped this dead in its tracks

link

yardstick 697 days ago

In CrowdStrikes case, they could have rolled out to even 1 million endpoints first and done an automated sanity/wellness check before unleashing the content update on everyone.

In the past when I have designed update mechanisms I’ve included basic failsafes such as automated checking for a % failed updates over a sliding 24-hour window and stopping any more if there’s too many failures.

link

goalieca 697 days ago

They need a lab full of canaries.

link

raffraffraff 697 days ago

yeah, I don't get the "we couldn't have tested it" crap, because "something happens to the payload after we tested it". Create a fake downstream company and put a bunch of machines in it. That's your final test before releasing to the rest of the world.

link

Am4TIfIsER0ppos 697 days ago

> let [...] owners control when to update

The only acceptable update strategy for all software regardless of size or importance

link