Hacker News new | ask | show | jobs
by cbanek 632 days ago
I've been on a lot of oncall lists... 4-5 per week seems extremely high to me. Have you gathered up and classified what the issues were? Are there any patterns or areas of the code that seem to be problematic? Are you actually fixing and getting to the root cause of issues or are they getting worse? It sounds like you don't know the answer because you don't really understand the problem.

If you don't have enough time to run the system and you have to do new feature work one has to give into the other, or you have to hire additional people (but this rarely solves the problem, if anything, it tends to make it worse for a while until the new person figures out their bearings).

One way that is very simple but not easy is to let the on call engineer not do feature work and only work on on-call issues and investigating/fixing on call issues for the period of time they are on-call, and if there isn't anything on fire, let them improve the system. This helps with things like comp-time ("worked all night on the issue, now I have to show up all day tomorrow too???") and letting people actually fix issues rather than just restart services. It also gives agency to the on-call person to help fix the problems, rather than just deal with them.

2 comments

On call engineers fixing on call bugs is one of the simplest and most straightforward way out of the hole.

You then also have a direct cost of being “on call” accounted for and on the sprint board.

"on call" shouldn't be an additional shift to have the employee at their desk. It's an emergency service with a defined SLA (acknowledge pager within X time, review issue and triage or escalate within Y time. Work on issue until service is restored/bug is rolled back (but not necessarily to the point of completing a long term fix)
This depends. There are several on-call paradigms.

In 2 of the 3 companies I've worked that have on-call, the On Call rotation has been a "the totality of your duties are being on call for [X] duration". There are no features to push, there is Op X and tickets of varying priority levels.

I've always seen it as a 'mode of operation' for a time period. Same schedule/timing unless something bad happens. Then you're the one to be woken up/disturbed. Outside of that... you're generally free to whatever maintenance, process, or feature work.

This is helpful when the incidents are less 'something to revert'... and more something to do or completely remove. If CICD relies on things on the internet for example, deploying caches to remove a laundry list of potential snags.

On call is a bit bipolar as a result. Either comfortably wandering around looking for something worth working on, or knowing what it is - dashing to put out flames! It's not sustainable so we all take turns.

I believe a poster above was correct with their intuition. I feel there's a broken/missing feedback loop. Regular incidents happen, but they shouldn't be constant. The goal should be to eradicate them, accepting a downward trend

> One way that is very simple but not easy is to let the on call engineer not do feature work and only work on on-call issues

I can vouch for this. Beyond just fixing bugs, they also are first to triage larger issues which led to higher quality bug reports. A lot of "investigate bug" tasks disappeared.