| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by ssalazars 2575 days ago

I understand it. I've worked in AWS, and now in OCI, dealing with systems that affect hundreds-to-thousands of customers, which businesses are at stake.

Mitigation is your top-priority. Bringing the system back to a good shape.

If there needs to be follow-up actions, take the less-impactful steps to prevent another wave.

If there was a deployment, roll-back.

My concern here is, a deployment have been made months ago, and many other changes that could make things worse were introduced. This is the case. The difference between taking an extra 10-20 minutes to make sure everything is fine, versus taking a hot call and causing another outage makes a big difference.

I'm just asking questions based on the documentation provided; I do not have more insights.

I am happy Stripe is being open about the issue, that way many the industry learns and matures regarding software-caused outages. Cloudflare's outage documentation is really good as well.

1 comments

londons_explore 2575 days ago

> My concern here is, a deployment have been made months ago, and many other changes that could make things worse were introduced.

Make every bit of software in your stack export as a monitoring metric it's build date. Have an alert if any bit of software goes over 1 month old. Manually or automatically re-build and redeploy that software.

That prevents 'bit rot' meaning you daren't reduild or rollback something.

link

bdamm 2574 days ago

In a lot of environments this is a terrible idea. In private environments exposing build manifest information is a good idea, but not so that you can alert at 1 month. Where I work, software that's 2-3 years old is considered good - mature, tested, thoroughly operationalized, and understood by all who need to interact with it on a daily basis. Often, consistency of the user experience is better than being bug free.

link