| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by jjice 1582 days ago
	Whenever I see a product this large having an outage (a lot recently, especially with Facebook, GCP, and AWS), I can only think of how stressful it must be for whoever needs to fix it. I've been on the other side of the outage before, albeit with a much smaller product, but lord was that stressful. Thinking about the random engineers that are stressed out and thinking that they could be fired for this, even if the cause wasn't their fault (since management at large companies can be pretty thick) is very upsetting to me. Say what you want about a large company having a large outage, but it's normal engineers that are trying to fix it at the end of the day, and I can sympathize.

4 comments

pqseags 1582 days ago

I can also sympathize with the engineers trying to fix this, but I hope that they wouldn't be thinking that they could be fired for this. Successful teams that I have worked on, even at big companies with high-usage products, have always promoted a culture of "systems break - let's improve the system, not blame a person". Any 'mistake' by an employee is actually a sign of a problem with the system. Any resilient system should account for human errors - those always happen. I wouldn't want to work for a team or company that would consider firing somebody for causing an outage rather than addressing the root cause.

link

rajbot 1581 days ago

I work at Slack and have learned a lot from our Incident Response program. Brent Chapman helped put it together, and has a USENIX talk about it here: https://www.usenix.org/conference/srecon21/presentation/chap...

Response for this incident went by the book, as described in Brent's talk above. Incident Management programs like these ensure that incidents can be resolved while also minimizing stress and chaos for engineers and other responders.

PagerDuty has a good Incident Responder and Incident Commander training courses, if you are interested in setting up a program similar to Slack's:

- https://response.pagerduty.com/training/courses/incident_res...

- https://response.pagerduty.com/training/incident_commander/

Fun fact: Brent Chapman is also known for creating the `majordomo` mailing list manager from the early 90s

link

lima 1582 days ago

In a well-managed engineering organization, it shouldn't be stressful. Sense of urgency, sure, but stress is typically only making things worse.

link

kazen44 1581 days ago

stressing out and panicing is the worst thing you can do in this situation.

usually, keep your head cool and focus on the problem at hand.

This is also the reason why engineering/operations should be seperate from customer communications. The people who are fixing the issue should not also be the ones doing the communication with the outside world.

link

aantix 1581 days ago

I often thought of the Zoom engineers on-call during the pandemic shutdown.

Talk about being on center stage..

link