Hacker News new | ask | show | jobs
by jjice 1582 days ago
Whenever I see a product this large having an outage (a lot recently, especially with Facebook, GCP, and AWS), I can only think of how stressful it must be for whoever needs to fix it. I've been on the other side of the outage before, albeit with a much smaller product, but lord was that stressful. Thinking about the random engineers that are stressed out and thinking that they could be fired for this, even if the cause wasn't their fault (since management at large companies can be pretty thick) is very upsetting to me. Say what you want about a large company having a large outage, but it's normal engineers that are trying to fix it at the end of the day, and I can sympathize.
4 comments

I can also sympathize with the engineers trying to fix this, but I hope that they wouldn't be thinking that they could be fired for this. Successful teams that I have worked on, even at big companies with high-usage products, have always promoted a culture of "systems break - let's improve the system, not blame a person". Any 'mistake' by an employee is actually a sign of a problem with the system. Any resilient system should account for human errors - those always happen. I wouldn't want to work for a team or company that would consider firing somebody for causing an outage rather than addressing the root cause.
I work at Slack and have learned a lot from our Incident Response program. Brent Chapman helped put it together, and has a USENIX talk about it here: https://www.usenix.org/conference/srecon21/presentation/chap...

Response for this incident went by the book, as described in Brent's talk above. Incident Management programs like these ensure that incidents can be resolved while also minimizing stress and chaos for engineers and other responders.

PagerDuty has a good Incident Responder and Incident Commander training courses, if you are interested in setting up a program similar to Slack's:

- https://response.pagerduty.com/training/courses/incident_res...

- https://response.pagerduty.com/training/incident_commander/

Fun fact: Brent Chapman is also known for creating the `majordomo` mailing list manager from the early 90s

In a well-managed engineering organization, it shouldn't be stressful. Sense of urgency, sure, but stress is typically only making things worse.
stressing out and panicing is the worst thing you can do in this situation.

usually, keep your head cool and focus on the problem at hand.

This is also the reason why engineering/operations should be seperate from customer communications. The people who are fixing the issue should not also be the ones doing the communication with the outside world.

I often thought of the Zoom engineers on-call during the pandemic shutdown.

Talk about being on center stage..