|
|
|
|
|
by openasocket
697 days ago
|
|
I work on a piece of software that is installed on a very large number of servers we do not own. The crowd strike incident is exactly our nightmare scenario. We are extremely cautious about updates, we roll it out very slowly with tons of metrics and automatic rollbacks. I’ve told my manager to bookmark articles about the crowdstrike incident and share it with anyone who complains about how slow the update process is. The two golden rules are to let host owners control when to update whenever possible, and when it isn’t to deploy very very slowly. If a customer has a CI/CD system, you should make it possible for them to deploy your updates through the same mechanism. So your change gets all the same deployment safety guardrails and automated tests and rollbacks for free. When that isn’t possible, deploy very slowly and monitor. If you start seeing disruptions in metrics (like agents suddenly not checking in because of a reboot loop) rollback or at least pause the deployment. |
|