|
|
|
Ask HN: Best practices for DevOps pager/on-call schedule?
|
|
9 points
by notfunk
4702 days ago
|
|
At my company, we do not have the luxury to hire many full-time ops guys and the engineers are part of the pager/on-call schedule. We've experimented with doing a daily rotating schedule (i.e. one person per day) and have been discussing other schedule options (such as weekly rotating). What DevOps pager/on-call schedule works best for your team? And are there any best practices that are noteworthy? |
|
First let me give you some advice about "what an emergency is" and "how we are alerted". You need to define what an emergency is in your company, and notify everyone (with clear guidelines on "how to get help"), so that you limit pages to critical issues, post this on an internal wiki (you have a wiki right?). What is really worth getting woken up and coming into the office for? Alerts are issued like this, nagios alerts go to email, these are not generally emergencies, a couple checks do fire sms alerts, so I consider these email alerts issues for the workday, and I do not check these on the weekend. An automated system scans syslog for alerts that might be an emergency (based on prior experience (i.e. db errors about a disk subsystem, etc)), we also have our apps log emergency issues to syslog, and if one is triggered, a sms goes out to the group. We also have any helpdesk tickets (you have a helpdesk right?), with "emergency" in the title issue a page, users know to do this via the "what an emergency is" wiki page.
When a page comes in, if you can take it, you simply issue a sms "ACK" to the group, this tells everyone that you have Accepted this page, and you are the owner. This helps us load balance across everyone's lives. If you need help, you pull in other people as needed. You also issue a sms "All Clear", when the issue is resolved, this will typically go alongside an email to the group with an issue summery.
This entire system does not need to be complex. Start simple and iterate as needed. There also needs to be a process to find out what happened, do we need more monitoring, additional syslog triggers, etc.
ps. our UPS, HVAC, and security systems can issues pages via sms too as needed. I didn't mention this because it highly dependent on our environment. We also use a modem and landline to issues these pages. We have a linux server with qpage [1] running on it, which issues the pages by dialing a landline at a telco. This allows us to issue pages if our network link goes down too.
pss check out my website @ http://sysadmincasts.com/ where I plan to cover issues like this.
[1] http://www.qpage.org/