Hacker News new | ask | show | jobs
by robalfonso 1436 days ago
Your org needs to be at either end of a spectrum. Either on-call is mostly quiet, and non-disruptive and truly only there for huge issues that happen seldom. Or you staff up a dedicated 24/7 team. If it's in between you need to plan on getting to one end before you wear out your team.

I think on-call and the quality of life component are highly dependent on the company culture, the types of alerts, etc.

My org on-call was laid out like this:

3 days at a time and then a break of X days (depending on team size - This option was chosen by the team)

Comp time for any incidences (plus manager flexibility, up late fixing something no one expects you in early or at all depending on how it went)

We leveraged a provider to handle alert escalation, rotation, phone calls etc. If someone didn't answer it rotated through to the next person and on up to management.

A regular look back at the type of calls coming in, and re-balance of alerting priorities to make sure if someone is going to get a call out of office hours, it better be necessary. We always asked "Could this have waited"

A general culture of helping out, if you couldn't fix something you could ask for anyone else near a machine to handle it.

A general culture of asking could we have automated a fix for this alert before getting a human involved?

Almost all tools were available via mobile and you would be amazed how often you could fix something from a mobile phone. In fact I fixed some service issue in about 10s in a movie, never missed a beat.

Trading on-call windows was typical and easy.

If your org can't do above and is truly wearing people out then you need to go the other way, and just staff up 24/7 and let people have their lives.