| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by Sakos 702 days ago

> The above are all things that could (and sould!) be done to reduce the chances of a misbehavior happening, but we must accept that the code bug was just the specific trigger this time around and a different trigger could have had similarly nefarious consequences. The root cause behind the outage lies in the process to get the configuration change shipped to the world.

> Now, SRE 101 (or DevOps or whatever you want to call it) says that configuration changes must be staged for slow and controlled deployment, and validated at every step. Those changes should first be validated in a very small scale before being pushed to the world later on, and every push should be incremental.

Unfortunately, the article is sort of burying the lede until half-way through until it makes some decent points.

We should be using safer languages, but also 1) how is it possible that CrowdStrike can push a content update globally to all clients with no option for their customers to delay it for testing and 2) why doesn't CrowdStrike have internal testing before deployment?