Hacker News new | ask | show | jobs
by seniortaco 631 days ago
4-5 issues per week can be a lot or a little, all depending on the severity of these issues. Likely most of the them are recurring issues your team sees a few times a month and the root cause hasn't been addressed and needs to be.

Driving down oncall load is all about working smarter, not necessarily harder. 30% of the issues likely need to be fixed by another team. This needs to be identified ASAP and the issues handed off so that they can parallelize the work while your team focuses on the issues you "own".

Setup a weekly rotation for issue triage and mitigation. The engineer oncall should respond to issues, prioritize based on severity, mitigate impact, and create and track Root Cause issues to fix the root cause. These should go into an operational backlog. This is 1 full time headcount on your team (but rotated).

To address the operational backlog, you need to build role expectations with your entire team. It helps if leadership is involved. Everyone needs to understand that in terms of career progression and performance evaluation, operational excellence is one of several role requirements. With these expectations clearly set, review progress with your directs in recurring 1-1s to ensure they are picking up and addressing operational excellence work, driving down the backlog.