Hacker News new | ask | show | jobs
On-call problems – here are mine. Do you feel the same way?
9 points by on-call_guy 1100 days ago
Hi Folks,

Being on-call has been one of the most painful part of my job as a software engineer now a days. There were a lot of stressful weeks I had spend with completely demotivated about how much time I have been spending on these issues which can be spent on the innovation. So I have listed my top issues in below ranks. I was wondering if others feel the same pain? I also wonder why can there be a solution built for these?

1. I do not have enough information in alert to jump right on the resolution

2. It’s not easy to find similar alerts triggered recently so that I can go back and find how they were fixed?

3. I don’t find runbooks useful most of the times as they are not up to date

4. I don’t know if there were any recently merged changes which caused these alerts/incidents

5. A lot of the time, I don’t know whom to reach out to if this alert is from other team.

6. I have to go to multiple systems to update the statuses or notes

7. I have to summarize all the details again as a part of on-call handoff summary doc at the end of the rotation

8 comments

I've experienced all these problems. There are solutions trying to address them. E.g. https://incident.io/ (which I'm not affiliated with in any way). It's not easy though. I think they all come from the root cause of teams not investing enough into making oncall processes and solutions good, and in particular not keeping things up-to-date. As you say, runbooks are often outdated. The same happens with lists mapping component ownership to teams.

There's another problem (#8 to add to the list) I also felt pain from: how you're scheduled to work oncall. We had ad-hoc manual scheduling of who would work oncall when. A tool for solving that is https://oncallscheduler.com (which I am affiliated with). It automates the oncall scheduling, while making it fair, predictable, and gives all engineers self-service control over when and how they're scheduled. I'd love some feedback on it.

I previously worked at a popular startup. I was part of a team that owned many business critical services. Our on-call was brutal. I have firsthand experienced most of the problems that you are talking about.

But interestingly, we solved these problem back then using an internal tool. Here is how the internal tool solved the problems -

1. It had integration with all internal tools like task management, alert management, monitoring systems and pagerduty. It offered 1 central dashboard where it all came together

2. Each team in pagerduty can see details of all alerts that happened in given shift/rotation. So anyone can go there anytime to see what alerts are fired, when and to whom.

3. Each alert had option to mark with various tags like noisy, non-actionable, etc. Additionally a note and follow up task links can be added.

The tool solves some of the problems you mentioned properly. e.g. You don't need to write a summary document. It all gets captured there and can be easily viewed in the handoff meetings. With tags, you can easily find bad alerts or alerts with outdated runbooks. Its easy to hold team-members accountable if they are not following process/best practices.

Oncall is a heavy process but IMO with right tooling, a lot of problems can be solved properly.

PS - I didn't create the tool but I used it extensively to get my team's oncall under control.

I see...how does it really work though? Like if I am on-call for this week- would it show all my alerts in one place and then allow me to take some kind of actions? How does it solve the other issues like stale runbooks, etc?
Some of these need to be fixed at a higher level.

On call monitoring responsibilities for a certain time period should be separate from resolution duties.

In other words, aside for some well defined ops issues that have clear runbooks, the role of the person monitoring should be find out or know who to escalate to, not resolve.

It's actually a great onboarding activity as it exposes new staff members to parts of the infra and operations that their managers/peers might have neglected to mention.

The second way to alleviate the issues is to pair a person such as yourself with a person that has a lot of institutional knowledge so that you can triage together, learn from them, and update the docs so the organisation as a whole has better resources. Eventually, the percentage of incidents where you don't have the institutional knowledge to know how to proceed will decrease to the point where it's mostly safe for you to do on-call on your own.

Then eventually you become the experienced on-call person that gets paired with the new employees that are gaining that institutional knowledge.

Yep, we do this in our team too (day time only though) and call it as "on-call buddy" for first couple of rotations.
I’m actually building exactly this. It’s a simple on-call and incident management platform that covers many of your frustrations since I’ve had the same ones. I’d love to talk to you about it and get your feedback on my progress so far if you’re interested.
Ok...Would love to know more about what is out there. The problem with current alerting solutions like PagerDuty, they are every extensive in terms of what they offer (scheduling, reliability, etc) but not quite tailored towards needs of on-call engineer to easily tag something or for management to get a view which alerts/incidents need attention.

Even xMatter which my wife's team use but never login as it crashes frequently and prefer to debug through logs :P

Do you have an email I could reach you at? I’d like to share what I’m thinking the solution looks like and hear more about the challenges you’ve faced.

I totally agree that the tools which exist today cater towards the buyer. That is, the people with purchasing power who typically aren’t on-call. I’m building with a focus on the on-call experience for the people who are actually on call.

I just stood up my landing page if you want to keep up to date: https://simpleoncall.com/.
Thanks for the link. It doesn't have much details. I have signed up though. Looking forward for getting more details about the solutions.
I worked for a systems integration and management firm for five years. We avoided the sorts of problems you describe by having a management who gave us the best tools and training for our work and in return demanded exacting documentation which had to be kept up to date as part of our work. Logs and alerts were refined to eliminate confusion. We were tasked with implementing scripts to correct, mitigate the effect of problems.

Being on-call is a challenge, but also an opportunity to improve processes. Your management should empower the team to fix the processes.

Thank you. Its a great point and totally agree that a good management plays a big role in making life little easy. We did raise it to our management. But one of the limitations from their end is as well too many different tools and scattered information which do not give them full insights. For examples, its very hard to know which runbooks are stale and needs updates unless your frequently review them. Curious to know how such problems were solved?
Why don't you update the run books? Why don't you modify the alerts/logs to give you more information? Why don't you create the missing run books when you run into undocumented issues?
It sounds like none of those things have an owner who is tasked with keeping them up to date and correct. All the work that needs to be done needs to have a specific well—documented owner, otherwise diffusion of responsibility ensures that it will eventually fall through the cracks.
Management's job is be "the owner". They are ultimately responsible to make sure that there is no diffusion of responsibility.

In our weekly meetings, recurring problems were identified and fixes implemented. No call was considered completed and closed until all relevant documents had been updated as appropriate. At the yearly review the quality of your documentation was as important as your time to respond, time to fix. That is how mission critical on-call work should be handled.

That's great. I think one of the issue in our process is we use wiki for on-call summary/hand-off notes. That's not ALWAYS very helpful as it has a dependency what engineers add to them. Also time and severity of the alerts make a difference as well. E.g. if they are triggered in the night/unfriendly time the first intuition of the engineer is to fix it and not to make a note or document unless there a easy way to do so. We use PagerDuty and I dont think it provides easy way to make those note or comments. So that leaves it to the engineers who need to do it after the fact. Some teammates do it rigorously where some dont. I think Management's challenge is also they can only push so much as it becomes an attrition risk :(
Yea, that’s not uncommon. Personally I prefer to give each document a specific owner, but either way you do it someone has to be tasked with ensuring that the documentation is correct.
It sounds like your team lacks a culture of continuous improvement - IMO in a product team on-call's full-time job is to make the next on-call engineer's job easier through deleting irrelevant alerts, automating fixes, and generally making the system more stable.

I wrote a longer guide about this here: https://onlineornot.com/incident-management/on-call/improvin...

Yeah, I must agree it is a cultural issue at some extent. But honestly the on-call my current company is quite demanding. So during the on-call week, though engineers try to improve it they always run out of the time or miss few things which then puts burden on future on-call.

I think there should be a nice light weight tool which should give a clear summary and tracking mechanism which make this a quicker tasks. Even just to tag the runbooks which are not updated. All those notes get lost in documentations and never referred back.

In previous teams, we just used a JIRA backlog to manage these tasks
Yeah, JIRA could be handy and useful though you need to create tickets for every tasks with a rigorous monitoring with other backlog and story items.
I work at https://incident.io/ where we build a product that aims to help with this.

I'll callout some of the points you make that we can help with:

1. Our incidents are created automatically from triggers like PagerDuty incidents/OpsGenie alerts/etc and we'll pull any information we find into the Slack channel and make it easily available (pinning to channel, setting it automatically as a channel bookmark, etc). That tends to help when you jump into the incident fresh, having everything easily available.

2. We don't do much matching against previous incidents (yet) but it's easy to search for similar incidents in our dashboard. Unlike alerts, incidents have a history of updates and curated detail about how they were resolved, so a history of similar incidents is genuinely useful to you if you're facing a similar problem.

5. We have an in-product catalog where you can store features, services, etc and who owns what. Most customers ask people 'what is affected?' and have us automatically page or say who owns the feature, which really helps speed up response. Some of our customers have 5k+ services, there's no way humans can remember who owns what at that scale.

6. This is our bread-and-butter, in that we plug into everything like status pages (we offer a native status page ourselves), Jira, GitHub, whatever to make sure incident updates are pushed everywhere. The idea is responders update the incident and we'll go share that everywhere it needs to go, instead of asking people to remember when they're busy responding.

7. Our incidents help massively with this. Provided responders are pushing updates to their incidents, an on-call handover turns into a super-quick review of the incidents in our dashboard and a review of the updates/outstanding actions.

tl;dr: a lot of what you've described can be fixed or helped massively by good tooling. Even stuff like runbooks being out-of-date is improved by tools that more frequently connect people to runbooks, as if they're more reliably useful people are more incentivised to update them.

Won't solve everything but if you're on the lookout for solutions you should definitely check-out incident.io and similar tools.

You can improve all of these. Why aren’t you?

e.g instead of complaining about outdated runbooks, you can just update them.

I agree they can be improved. But they are not one-time activity. It should be a continuous process in order to really be efficient. Also every on-call person needs to be diligent about it. Otherwise, it aggregates in future.

Again, I am talking about the teams who are heavily loaded with alerts and incidences during on-call. In general, all these pain points very much vary depending on the on-call load. We also have a teams who do manage all of these easily as their on-call load is quite manageable. But solving for them in one place so that everyone from the team is on the same would be amazing.

I think the biggest hussle during on-call is that a lot of stuffs have no clear ownership so I don't know who to throw the hot potato to. We are switching to a better solution with clear ownership so hopefully it helps.
lol...yeah we have that established within the team now. But still a challenge to find out a right contact across teams at 3am in the morning if the issue is from another team. May be we should build a service level ownership list so that we can tag them (in addition to their on-call). Curious to know what level of ownership you were referring about?
Basically the same as yours: who to call when shtf. The tricky part is that managers don't do trench work so once the developer responsible does not reply then the oncall has to figure it out.
Got it. I wish there was an up to date service level owner identified which will then reduce this to just a lookup and tagging. I have seen few engineering teams in other companies started doing that.