| - Every team is free to choose the on-call practices (escalations, ramp-up for new members, etc) that work for them, although many practices are shared. Tooling is the same across the board. I'll speak for my team. - Rotations are mixed, consisting of weekdays and weekends. In other words, if you're on-call M-F this week, in a few weeks you'll go on-call over the weekend. - If the on-call engineer does not pick up, the incident is escalated to the manager, then their manager, all the way up to the CTO. Some teams have secondaries before managers are involved. - The on-call engineer would normally work off the regular backlog. We have talked about pulling them out to work on on-call + tech debt exclusively, but there hasn't been enough on-call churn to justify this. - The on-call is on the hook for resolving the immediate issue. In many cases this does not mean actually fixing the underlying problem. Those get written up as tickets to be prioritized as part of the backlog. - The priority is determined as a team. As a manager, I encourage the team to aggressively tackle on-call issues, don't want to be a blocker for that. If something consistently pages us, it is guaranteed to make it to the top of the backlog fast. We have also chosen to prioritize product work over on-call tickets when we felt our pace was good and the on-call churn not too terrible. Kanban makes priority changes pretty easy. - Not sure what you mean by "on-call can't complete in time". On-call-related tickets end up on the backlog so anyone can tackle them, not just the person that filed it. - We have a pretty good on-call culture and the teams are very sensitive to other teams' pain. If we're being paged about an issue with another team's code (shouldn't happen too often), there's always the option to page their on-call and triage together. - We track operational churn, send a weekly on-call hand-off email to the team with notes about the pages and steps taken, and have operational review meetings where these emails are reviewed (+ any other operational matters) and next steps are determined. Maintaining transparency around operational pain and building structure around the follow-up process has been really helpful in reducing our weekly churn. - I think the ideal number of team members that should be on-call is around 5 to 8. Anything less than that and rotations become very frequent and a burden on the individuals. Anything more and the rotations are so rare that every on-call feels like the first time. - Last piece of advice is to FIX THE PROBLEMS (or change your alerts). Build whatever process you need to make sure that you have plenty of leeway to address anything that pages you constantly. Don't get overly attached to the alerts - you might find that changing sensitivity thresholds or deleting the alert altogether might actually be the right answer (please don't do that for the wrong reasons though :)). If you're paged for a bunch of things that you can't fix, you're probably doing it wrong. Just like technical debt, if you don't tend to on-call issues they WILL get out of hand. Picking away at it slow and steady will almost certainly help reduce it from a torrent to a trickle. N.B. I work at PagerDuty. A bunch more suggestions here: https://www.pagerduty.com/blog/tag/best-practices/. |