Hacker News new | ask | show | jobs
by theideaofcoffee 638 days ago
Alert fatigue. Alert fatigue. Alert fatigue. It's the single biggest quality of life thing that you can do to help with the annoyance that is on call. If you know you're in store for the same alert again and again, or perhaps even know that you know you're going to get paged, it's hard to think about anything else. It becomes then a game of normalizing deviance and burnout: "oh, we just ignored that one last time". Ok, why are they alerts then if they can be ignored? It's just going to murder people's spirit after a while.

Someone gets called in the middle of the night? Let them take the morning to recover, no questions asked, better yet, the entire day if it was a particularly hairy issue. This is the time where your mettle as a manager is really tested against your higher-ups. If your people are putting in unscheduled time, you better be ready to cough up something in return.

Figure out what's commonly coming up and root cause those issues so they can finally be put to bed (and your on-call can go back to bed, hah).

Everyone that touches a system gets put on call for that same system. That creates an incentive to make it resilient so they don't have to be roused and so there's less us-vs-them and throwing issues over the wall.

Beyond that, if someone is on call, that's all they should be doing. No deep feature work, they really should be focusing on alerts, what's causing them, how to minimize, triaging and then retro-ing so they're always being pared down.

Lean on your alerting system to tell you the big things: when, why, how often, all that. The idea is you should understand exactly what is happening and why, you can't do much to fix anything if you don't know the why.

Look at your documentation. Can someone that is perhaps less than familiar with a given system easily start to debug things, or do they need to learn the entire thing before they can start fixing? Make sure your documentation is up to date, write runbooks for common issues (better yet, do some sort of automation work to fix those, computers are good at logic like that!), give enough context that being bleary eyed at 3:30am isn't that much of a hindrance. Minimize the chances of having to call in a system's expert to help debug. Everyone should be contributing there (see my fourth line above).

Make sure you are keeping an eye on workload too. You may need to think about increasing the number of people on your team if actual feature work isn't getting done because you're busy fighting fires.