Hacker News new | ask | show | jobs
by on-call_guy 1099 days ago
Thank you. Its a great point and totally agree that a good management plays a big role in making life little easy. We did raise it to our management. But one of the limitations from their end is as well too many different tools and scattered information which do not give them full insights. For examples, its very hard to know which runbooks are stale and needs updates unless your frequently review them. Curious to know how such problems were solved?
2 comments

Why don't you update the run books? Why don't you modify the alerts/logs to give you more information? Why don't you create the missing run books when you run into undocumented issues?
It sounds like none of those things have an owner who is tasked with keeping them up to date and correct. All the work that needs to be done needs to have a specific well—documented owner, otherwise diffusion of responsibility ensures that it will eventually fall through the cracks.
Management's job is be "the owner". They are ultimately responsible to make sure that there is no diffusion of responsibility.

In our weekly meetings, recurring problems were identified and fixes implemented. No call was considered completed and closed until all relevant documents had been updated as appropriate. At the yearly review the quality of your documentation was as important as your time to respond, time to fix. That is how mission critical on-call work should be handled.

That's great. I think one of the issue in our process is we use wiki for on-call summary/hand-off notes. That's not ALWAYS very helpful as it has a dependency what engineers add to them. Also time and severity of the alerts make a difference as well. E.g. if they are triggered in the night/unfriendly time the first intuition of the engineer is to fix it and not to make a note or document unless there a easy way to do so. We use PagerDuty and I dont think it provides easy way to make those note or comments. So that leaves it to the engineers who need to do it after the fact. Some teammates do it rigorously where some dont. I think Management's challenge is also they can only push so much as it becomes an attrition risk :(
Yea, that’s not uncommon. Personally I prefer to give each document a specific owner, but either way you do it someone has to be tasked with ensuring that the documentation is correct.
It sounds like your team lacks a culture of continuous improvement - IMO in a product team on-call's full-time job is to make the next on-call engineer's job easier through deleting irrelevant alerts, automating fixes, and generally making the system more stable.

I wrote a longer guide about this here: https://onlineornot.com/incident-management/on-call/improvin...

Yeah, I must agree it is a cultural issue at some extent. But honestly the on-call my current company is quite demanding. So during the on-call week, though engineers try to improve it they always run out of the time or miss few things which then puts burden on future on-call.

I think there should be a nice light weight tool which should give a clear summary and tracking mechanism which make this a quicker tasks. Even just to tag the runbooks which are not updated. All those notes get lost in documentations and never referred back.

In previous teams, we just used a JIRA backlog to manage these tasks
Yeah, JIRA could be handy and useful though you need to create tickets for every tasks with a rigorous monitoring with other backlog and story items.