|
|
|
|
|
by cbanek
632 days ago
|
|
I've been on a lot of oncall lists... 4-5 per week seems extremely high to me. Have you gathered up and classified what the issues were? Are there any patterns or areas of the code that seem to be problematic? Are you actually fixing and getting to the root cause of issues or are they getting worse? It sounds like you don't know the answer because you don't really understand the problem. If you don't have enough time to run the system and you have to do new feature work one has to give into the other, or you have to hire additional people (but this rarely solves the problem, if anything, it tends to make it worse for a while until the new person figures out their bearings). One way that is very simple but not easy is to let the on call engineer not do feature work and only work on on-call issues and investigating/fixing on call issues for the period of time they are on-call, and if there isn't anything on fire, let them improve the system. This helps with things like comp-time ("worked all night on the issue, now I have to show up all day tomorrow too???") and letting people actually fix issues rather than just restart services. It also gives agency to the on-call person to help fix the problems, rather than just deal with them. |
|
You then also have a direct cost of being “on call” accounted for and on the sprint board.