Hacker News new | ask | show | jobs
by caw 4702 days ago
Megacorp sysadmin here - we do on-call for weekly rotations, though technically anyone can get woken up for the service they own. Weekly is easy to schedule, and it lets our boss know who the contact is for the week (since the schedule is on the wiki).

Never page if it's not an absolute dire emergency. One server out of a cluster - Next Business Day. Failed disk - NBD, unless you're out of hot spares.

As much of your work as possible should be automated to fix it without you having to touch anything. Service down? Try restarting it. Still down? Maybe then consider an email or page.

Other stuff

+Monthly or quarterly sync up meetings between all pager people. Doubly so during super critical times for the business to ensure stability.

+Single email list/PDL for the on-call (+ manager) so they can communicate about issues, as well as be cc'd on vendor support tickets (helps with hand offs)

+FAQ for your services so you don't have to wake the DBA or web admin until you know it's really hosed.

+(Sounds silly, but bears mentioning) During pager hand-off, last week's guy and this week's guy should talk about what happened and if there's anything they should know

1 comments

"During pager hand-off, last week's guy and this week's guy should talk about what happened and if there's anything they should know"

Agreed, we were thinking of doing week long rotations (Tuesday - Tuesday) with a "hand off conversation" happening on Tuesdays.

Tuesday does solve the 3 day weekend problem. What do you do if Monday is a holiday? Trade on Monday morning and meet up outside of work, or just hold it till Tuesday. Most of the time we just hold it.

The reason for this discussion is because up until a certain seniority level, you get "hazard pay" for carrying the pager. You get paid 1 hour for every so many you're on call. A weekend/holiday is 24 hours instead of 8 on the day your receive it or 16 on a weekday.

You should also cover rules for holding the pager. Ours include no alcohol, and no more than 1 hour away from the site (certain emergencies may require on-site visits). You also need to respond within 20 minutes, otherwise it gets escalated, or in certain larger locations, sent to the backup on-call person.