|
|
|
|
|
by mattbrewsbytes
642 days ago
|
|
Think of on-call like medical triage. On-call should triage outage (partial/full) level scenarios and respond to alerts, take immediate actions to remedy the situation (restart services, scale up, etc.) and then create follow-on tickets to address root causes that go into the pool of work the entire team works. Like an ER team stabilizing a patient and identifying next steps or sending the patient off to a different team to take time in solving their longer term issue. The team needs to collectively work project work _and_ opex work coming from on-call. On-call should be a rotation through the team. Runbooks should be created on how to deal with scenarios and iterated on to keep updated. Project work and opex work are related, if you have a separate team dealing with on-call from project work then there isn't a sense of ownership of the product since its like throwing things over a wall to another team to deal with cleaning up a mess. |
|