Hacker News new | ask | show | jobs
by windows2020 631 days ago
1) Identify on-call issues that aren't engineering issues or for which there's a workaround. Maybe institutional knowledge needs to be aggregated and shared.

2) Automate application monitoring by alerting at thresholds. Tweak alerts until they're correct and resolve items that trigger false positives.

3) If issues are coming from a system someone who is still there designed, they should handle those calls.

4) You mention long-term fixes for on-call issues. First focus on short-term fixes.

5) Set a new expectation that on-call issues are an unexpected exceptions. If they occur, the root cause should be resolved. But see point 4.

6) On-call issues become so rare that there's an ordered list of people to call in the event of an issue. The team informally ensures someone is always available. But if something happens, everyone else who's available is happy to jump on a call to help understand what's going on and if conditions permit, permanently resolve the next business day.