| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by fractalcat 3810 days ago

Last time I was in a role which involved on-call rotation:

> expected duties (only answer on-call, do other work, etc)

Only answer pages. My employer did shifts a bit differently from most companies - only six hours per shift, no fixed schedule (decided a week in advance) and only outside of work hours (pages during work hours were handled by whichever sysadmins were on duty), which worked quite well to avoid burning out sysadmins. On-call shifts were paid, and shortages of volunteers were rare.

I'd expect to spend maybe fifteen minutes per shift fixing things, on average (this is in managed hosting, so a page could be any of our customers' services).

> how deep does your on-call dive into code to fix, vs triaging, patching, and creating follow up work to fix permanently?

In my case (sysadmin for a managed hosting company) the code involved was often not under our control; the standard practice was to escalate to the customer if the cause of the outage was a bug in the application. The usual process when suspecting a bug was to track it down if possible (the codebases were usually unfamiliar, so this wasn't always the case), work around it as best we could (e.g., temporarily disable a buggy search indexer which was leaking memory, et cetera), and then get in touch with the customer (by email if the workaround was expected to last until work hours, by phone if not). Occasionally I'd fix the bug in-place and send the customer the patch, but this was technically outside scope.

> priority order of work (on-call tickts vs regular work vs existing on-call tickets vs special queue of simple tasks)

The only priorities were resolving the pages at hand and arranging followup where needed (usually raising a ticket to be followed up during work hours).

> what happens when the current on-call can't complete in time?

Generally the on-call sysadmin would resolve whichever pages they had acknowledged; in the event of an extended outage the acking sysadmin was expected to brief and hand over to the person on the next shift.

> how do you manage for other teams' risk? (ie their api goes down, you can't satisfy your customers)

In practice, we could escalate to anyone in the company for a serious outage we were unable to handle ourselves. This was pretty rare, as a small ops-heavy company, but everyone had access to everyone else's cell phone number and an outage-inducing bug was usually sufficient cause to wake someone up if it couldn't be worked around.