Hacker News new | ask | show | jobs
by bdamm 2539 days ago
In my experience customers deeply detest the idea of waiting around for a failure case to re-occur so that you can understand it better. When your customers are losing millions of dollars in the minutes you're down, mitigation would be the thing, and analysis can wait. All that is needed is enough forensic data so that testing in earnest to reproduce the condition in the lab can begin. Then get the customers back to working order pronto. 20 minutes seems like a lifetime if in fact they were concerned that the degradation could happen again at any time. 20 minutes seems like just enough time to follow a checklist of actions on capturing environmental conditions, gather a huddle to make a decision, document the change, and execute on it. Commendable actually, if that's what happened.
2 comments

> In my experience customers deeply detest the idea of waiting around for a failure case to re-occur so that you can understand it better.

Bryan Cantrill has a great talk[0] about dealing with fires where he says something to the effect of:

> Now you will find out if you are more operations or development - developers will want to leave things be to gather data and understand, while operations will want to rollback and fix things as quickly as possible

[0] Debugging Under Fire: Keep your Head when Systems have Lost their Mind - Bryan Cantrill: https://www.youtube.com/watch?v=30jNsCVLpAE

I understand it. I've worked in AWS, and now in OCI, dealing with systems that affect hundreds-to-thousands of customers, which businesses are at stake.

Mitigation is your top-priority. Bringing the system back to a good shape.

If there needs to be follow-up actions, take the less-impactful steps to prevent another wave.

If there was a deployment, roll-back.

My concern here is, a deployment have been made months ago, and many other changes that could make things worse were introduced. This is the case. The difference between taking an extra 10-20 minutes to make sure everything is fine, versus taking a hot call and causing another outage makes a big difference.

I'm just asking questions based on the documentation provided; I do not have more insights.

I am happy Stripe is being open about the issue, that way many the industry learns and matures regarding software-caused outages. Cloudflare's outage documentation is really good as well.

> My concern here is, a deployment have been made months ago, and many other changes that could make things worse were introduced.

Make every bit of software in your stack export as a monitoring metric it's build date. Have an alert if any bit of software goes over 1 month old. Manually or automatically re-build and redeploy that software.

That prevents 'bit rot' meaning you daren't reduild or rollback something.

In a lot of environments this is a terrible idea. In private environments exposing build manifest information is a good idea, but not so that you can alert at 1 month. Where I work, software that's 2-3 years old is considered good - mature, tested, thoroughly operationalized, and understood by all who need to interact with it on a daily basis. Often, consistency of the user experience is better than being bug free.