|
|
|
|
|
by ssalazars
2528 days ago
|
|
I understand it. I've worked in AWS, and now in OCI, dealing with systems that affect hundreds-to-thousands of customers, which businesses are at stake. Mitigation is your top-priority. Bringing the system back to a good shape. If there needs to be follow-up actions, take the less-impactful steps to prevent another wave. If there was a deployment, roll-back. My concern here is, a deployment have been made months ago, and many other changes that could make things worse were introduced. This is the case. The difference between taking an extra 10-20 minutes to make sure everything is fine, versus taking a hot call and causing another outage makes a big difference. I'm just asking questions based on the documentation provided; I do not have more insights. I am happy Stripe is being open about the issue, that way many the industry learns and matures regarding software-caused outages. Cloudflare's outage documentation is really good as well. |
|
Make every bit of software in your stack export as a monitoring metric it's build date. Have an alert if any bit of software goes over 1 month old. Manually or automatically re-build and redeploy that software.
That prevents 'bit rot' meaning you daren't reduild or rollback something.