| > Big props to the on-calls during this. Kind of curious about this. I know this is probably company specific but how do outages get handled at large orgs? Would the on-calls have been called in first then called in the rest of the relevant team? Is their a leadership structure that takes command of the incident to make big coordinated decisions to manage the risk of different approaches? Would this have represented crunch time to all the relevant people or would this be a core team with other people helping as needed? |
Yes. This was a multi-day outage and eventually the oncall does need sleep, so you need more of the team to help with it. Typically, at any reasonable team, everyone that chipped in nights get to take off equivalent days and sprint tasks are all punted.
Yes. Not just to manage risks, but also to get quick prioritization from all teams at the company. "You need legal? Ok, meet ..." "You need string translations? Ok escalated to ..." "You need financial approval? Ok, looped in ..."
Kinda. Definitely would have represented crunch time, but a very very demoralizing crunch time. Managers also try to insulate most of their teams from it, but everyone pays attention anyways. Keep in mind these typically only last an hour or 3, at most they last a few days, so there is no "core team" other than the leadership structure from your question 2. Otherwise, it is very much "people/teams helping as needed".