Hacker News new | ask | show | jobs
by caw 3807 days ago
I've worked two different on-call systems. The first was an enterprise company with a very defined process. 1 hour of pay for every 10 on-call (even if you're salaried, they figure out your hourly pay for this). Must be able to respond and/or get into the office in 1 hour, otherwise they call your manager to get someone working on the issue. Our 24/7 support would triage all issues before escalating on the on-call engineer--they were able to resolve most issues. Rotation period was roughly 1 week in 3 or 4 on a volunteer basis. We did have a group of software engineers who had an on-call rotation for the internal applications that powered the business. If I needed additional support for the application I could tell our 24/7 support folks to page/call them and they would hop online, which was needed on rare occasion when I couldn't fix an application error or a bug slipped the manual and automated QAs.

My current on-call system is considerably different as a startup. Most weeks there will be no pager alerts, some weeks will be particularly bad because something fell out of stability due to some other changes. There is a primary and secondary on-call. Each level has 20 minutes to respond and get online before it escalates (no office requirement since we're cloud based). The rotation goes through devops and all the software engineers, so you're on call for 2 weeks in 8-10 with no extra pay. I wouldn't recommend this method of scheduling because it's mandatory for all and some people don't take the duties seriously because nothing bad will happen if you let your pager slip to the next line, other than irking your coworker. There's no incentive to learn how to do the repairs or do them. I've seen a lot of "Oh my phone was off, I didn't realize I was on-call" that never happened at my previous job with the volunteers for extra pay. Having the secondary is nice because it is a guaranteed person you can escalate to for help on a complex issue, and they are available if you need to be indisposed for a short period of time.

About your specific questions: Your on-call duties are to communicate, fix what's broken, coordinate any additional escalations that need to happen, and most likely host the postmortem. Nothing more. Update your status page, send out an alert internally that X is broken, etc. When it's fixed and normal service has resumed, communicate that as well. You're the point of contact for anyone with questions about the issue, not anyone you had to bring online for a fix because if they're interrupted they're not fixing.

Don't start non on-call work. Fix to the extent that you know how and will make it stable until business hours. Not everyone has coded every system and knows the "permanent fix" for every issue. Your priority is based on what is broken and the business criticality of it. If multiple systems have failed, you fix the most critical ones first, which are normally the customer facing ones. There should be no "existing on-call tickets" because on-call bandaids and makes a high priority issue in the normal work queue.

If on-call can't complete the work in a reasonable amount of time, it may make sense to raise other people who can help. If it's going to take you 4 hours to get the $CRITICAL_FUNCTIONALITY back online, but getting Joe to help you will only make it take 1, then by all means try to get Joe if he's available. Again, based on your rotation he may not intend to be available, may not have his phone on, and may not want to take the call.

If you're dependent on 3rd parties, your application needs to take that into account. If you wake up because the external API is down and your functionality is down because of it, all you're going to be doing is losing sleep waiting for it to come back up to send out the all-clear. This changes somewhat if your external API comes with a SLA and a telephone number -- by all means call and start their triage chain.

A good thing to have is a sync up meeting with the on-call folks so you can establish patterns in the alerts that may not be evident by any one person.