|
|
|
|
|
by WaxProlix
1613 days ago
|
|
Oncalls get paged first and then escalate. As they assess impact to other teams and orgs, they usually post their tickets to a shared space. Once multiple team/org impact is determined, leadership and relevant ops groups (networking, eg) get pulled in to a call. A single ticket gets designated the Master Ticket for the Event, and oncalls dump diagnostic info there. Root cause is found (hopefully), affected teams work to mitigate while RC team rushes to fix. The largest of these calls I've seen was well into the hundreds of sw engineers, managers, network engineers, etc. |
|
Thanks for the answer, I have only ever worked with such a small team that we are all on a call every day.
I can imagine it can probably get a little hectic in large group calls? On the engineering side is there a command structure? Like say the root cause was found and RC team is rushing to fix it. But another team wants to mitigate in the mean time in a slightly risky way. Would their manager make a case with leadership? Would the proposed plan just be put out for general comment as a response to that main ticket?