|
|
|
|
|
by TimeWeSp
1082 days ago
|
|
I've experienced all these problems. There are solutions trying to address them. E.g. https://incident.io/ (which I'm not affiliated with in any way). It's not easy though. I think they all come from the root cause of teams not investing enough into making oncall processes and solutions good, and in particular not keeping things up-to-date. As you say, runbooks are often outdated. The same happens with lists mapping component ownership to teams. There's another problem (#8 to add to the list) I also felt pain from: how you're scheduled to work oncall. We had ad-hoc manual scheduling of who would work oncall when. A tool for solving that is https://oncallscheduler.com (which I am affiliated with). It automates the oncall scheduling, while making it fair, predictable, and gives all engineers self-service control over when and how they're scheduled. I'd love some feedback on it. |
|