Hacker News new | ask | show | jobs
Ask HN: Best practices for DevOps pager/on-call schedule?
9 points by notfunk 4702 days ago
At my company, we do not have the luxury to hire many full-time ops guys and the engineers are part of the pager/on-call schedule. We've experimented with doing a daily rotating schedule (i.e. one person per day) and have been discussing other schedule options (such as weekly rotating).

What DevOps pager/on-call schedule works best for your team? And are there any best practices that are noteworthy?

4 comments

Sorry for the brain dump, but here goes, I guess it depends on how large your company is, and the size of your admin team, along with how many emergencies you expect to have (is there is history you can look at? X emergency per month, etc?). We have 4 sysadmins, and we each have a cell phone, where we can communicate via sms in an emergency. We are all on call, but no one is required to answer, so there is no schedule! Our emergency rate is low, one page every 3-4 months (if that). When your emergency rate is low, by stabilizing your environment, and clearing defining what an emergency is, we have a culture where emergencies are really emergencies, like HVAC outage, uplink outages, etc. Personally, I like to be in the loop, even if I'm not helping. This does not happen often, as I said, so there is not really a burden.

First let me give you some advice about "what an emergency is" and "how we are alerted". You need to define what an emergency is in your company, and notify everyone (with clear guidelines on "how to get help"), so that you limit pages to critical issues, post this on an internal wiki (you have a wiki right?). What is really worth getting woken up and coming into the office for? Alerts are issued like this, nagios alerts go to email, these are not generally emergencies, a couple checks do fire sms alerts, so I consider these email alerts issues for the workday, and I do not check these on the weekend. An automated system scans syslog for alerts that might be an emergency (based on prior experience (i.e. db errors about a disk subsystem, etc)), we also have our apps log emergency issues to syslog, and if one is triggered, a sms goes out to the group. We also have any helpdesk tickets (you have a helpdesk right?), with "emergency" in the title issue a page, users know to do this via the "what an emergency is" wiki page.

When a page comes in, if you can take it, you simply issue a sms "ACK" to the group, this tells everyone that you have Accepted this page, and you are the owner. This helps us load balance across everyone's lives. If you need help, you pull in other people as needed. You also issue a sms "All Clear", when the issue is resolved, this will typically go alongside an email to the group with an issue summery.

This entire system does not need to be complex. Start simple and iterate as needed. There also needs to be a process to find out what happened, do we need more monitoring, additional syslog triggers, etc.

ps. our UPS, HVAC, and security systems can issues pages via sms too as needed. I didn't mention this because it highly dependent on our environment. We also use a modem and landline to issues these pages. We have a linux server with qpage [1] running on it, which issues the pages by dialing a landline at a telco. This allows us to issue pages if our network link goes down too.

pss check out my website @ http://sysadmincasts.com/ where I plan to cover issues like this.

[1] http://www.qpage.org/

Wow, thanks for the brain dump!

We're an established team/product, so we have an internal wiki, help/support desk, and use PagerDuty. We just want to shift away from only a few people (basically 2) handling DevOps emergencies and spread the experience over more members of our engineering team.

With the "all on call/no schedule" route, have you ever had a scenario where no one acknowledged an issue?

Megacorp sysadmin here - we do on-call for weekly rotations, though technically anyone can get woken up for the service they own. Weekly is easy to schedule, and it lets our boss know who the contact is for the week (since the schedule is on the wiki).

Never page if it's not an absolute dire emergency. One server out of a cluster - Next Business Day. Failed disk - NBD, unless you're out of hot spares.

As much of your work as possible should be automated to fix it without you having to touch anything. Service down? Try restarting it. Still down? Maybe then consider an email or page.

Other stuff

+Monthly or quarterly sync up meetings between all pager people. Doubly so during super critical times for the business to ensure stability.

+Single email list/PDL for the on-call (+ manager) so they can communicate about issues, as well as be cc'd on vendor support tickets (helps with hand offs)

+FAQ for your services so you don't have to wake the DBA or web admin until you know it's really hosed.

+(Sounds silly, but bears mentioning) During pager hand-off, last week's guy and this week's guy should talk about what happened and if there's anything they should know

"During pager hand-off, last week's guy and this week's guy should talk about what happened and if there's anything they should know"

Agreed, we were thinking of doing week long rotations (Tuesday - Tuesday) with a "hand off conversation" happening on Tuesdays.

Tuesday does solve the 3 day weekend problem. What do you do if Monday is a holiday? Trade on Monday morning and meet up outside of work, or just hold it till Tuesday. Most of the time we just hold it.

The reason for this discussion is because up until a certain seniority level, you get "hazard pay" for carrying the pager. You get paid 1 hour for every so many you're on call. A weekend/holiday is 24 hours instead of 8 on the day your receive it or 16 on a weekday.

You should also cover rules for holding the pager. Ours include no alcohol, and no more than 1 hour away from the site (certain emergencies may require on-site visits). You also need to respond within 20 minutes, otherwise it gets escalated, or in certain larger locations, sent to the backup on-call person.

There are two pieces of advice that I give all of the startups I work with: - No Spurious Alerts - Don't test code in production, ever.

Keep a release schedule, stick to it, do not deviate. If you can't get your stuff tested before the deadline, thats on you and your peers should not suffer. Make sure that all engineer receive alerts via email, its neccesary to "share the pain" so that people get an idea of what mistakes do. Weekly rotation is probably the best thing to do, that way there is a consistent point person for the week.

Agreed, we are thinking the weekly rotation will be better then daily to help whomever is on call to have a sense of "ownership" of the environment.

Our overall goal is to keep the DevOps skills sharp between as many members of the team as possible...

Have you considered contracting the ops out? I am in the process of developing a managed service for ops, using a lot of automation, config management etc; basically I install a chef agent on the server, and do everything via code I've written, email me if you want to talk about it anthony@makerops.com. Otherwise, are you doing 24x7 rotations? That will be the determinate along with everyones' geo location, on the best way to set up shifts.
We have not discussed this point, but I'm assuming it's off the table due to cost. And yes, this is for 24x7 rotations where everyone is in the same timezone...
I'd love to get feedback on pricing etc, to see how far off I am, if you have a second to email me? If not, nbd.

http://blog.pagerduty.com/2011/03/on-call-best-practices-par...

This is a series of posts that have pretty sane defaults; I personally would not do a daily rotation, but rather rotations of 5 days, and alternating weekends (one guy does M-F, one guy does Sat/Sund) and you switch off.

Nice link! And I agree the daily rotation is a bad idea, however we're leaning towards doing a Tuesday-to-Tuesday week long shift.