| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by WestCoastJustin 4702 days ago

Sorry for the brain dump, but here goes, I guess it depends on how large your company is, and the size of your admin team, along with how many emergencies you expect to have (is there is history you can look at? X emergency per month, etc?). We have 4 sysadmins, and we each have a cell phone, where we can communicate via sms in an emergency. We are all on call, but no one is required to answer, so there is no schedule! Our emergency rate is low, one page every 3-4 months (if that). When your emergency rate is low, by stabilizing your environment, and clearing defining what an emergency is, we have a culture where emergencies are really emergencies, like HVAC outage, uplink outages, etc. Personally, I like to be in the loop, even if I'm not helping. This does not happen often, as I said, so there is not really a burden.

First let me give you some advice about "what an emergency is" and "how we are alerted". You need to define what an emergency is in your company, and notify everyone (with clear guidelines on "how to get help"), so that you limit pages to critical issues, post this on an internal wiki (you have a wiki right?). What is really worth getting woken up and coming into the office for? Alerts are issued like this, nagios alerts go to email, these are not generally emergencies, a couple checks do fire sms alerts, so I consider these email alerts issues for the workday, and I do not check these on the weekend. An automated system scans syslog for alerts that might be an emergency (based on prior experience (i.e. db errors about a disk subsystem, etc)), we also have our apps log emergency issues to syslog, and if one is triggered, a sms goes out to the group. We also have any helpdesk tickets (you have a helpdesk right?), with "emergency" in the title issue a page, users know to do this via the "what an emergency is" wiki page.

When a page comes in, if you can take it, you simply issue a sms "ACK" to the group, this tells everyone that you have Accepted this page, and you are the owner. This helps us load balance across everyone's lives. If you need help, you pull in other people as needed. You also issue a sms "All Clear", when the issue is resolved, this will typically go alongside an email to the group with an issue summery.

This entire system does not need to be complex. Start simple and iterate as needed. There also needs to be a process to find out what happened, do we need more monitoring, additional syslog triggers, etc.

ps. our UPS, HVAC, and security systems can issues pages via sms too as needed. I didn't mention this because it highly dependent on our environment. We also use a modem and landline to issues these pages. We have a linux server with qpage [1] running on it, which issues the pages by dialing a landline at a telco. This allows us to issue pages if our network link goes down too.

pss check out my website @ http://sysadmincasts.com/ where I plan to cover issues like this.

[1] http://www.qpage.org/

1 comments

notfunk 4702 days ago

Wow, thanks for the brain dump!

We're an established team/product, so we have an internal wiki, help/support desk, and use PagerDuty. We just want to shift away from only a few people (basically 2) handling DevOps emergencies and spread the experience over more members of our engineering team.

With the "all on call/no schedule" route, have you ever had a scenario where no one acknowledged an issue?

link