Hacker News new | ask | show | jobs
by zxcvbn4038 2456 days ago
My advice is to use your time on call to your advantage. Don’t address just the symptoms - when you receive a call try to understand the root cause and take steps to prevent that situation from happening again. For example - if paged for low disk space make sure log rotation is present, working, and aggressive enough to stay ahead of the generation rate. Have the thing that checks the disk space preform the most common remediation steps and then page only if unsuccessful. If your in the cloud then just kill anything that runs out of disk space, it’s the application owners responsability to arrange for long term storage, etc. Do this for every call you receive and soon your phone will be silent.

My employer makes use of Pagerduty and I’ve spent a lot of time setting up “auto-resolve” of alerts. I even hook into AWS autoscaling lifecycle events and send mock “OK” actions when something gets terminated that had thrown an alarm. I still get paged but most issues solve themselves if I wait one more monitoring interval.

I’ve also used being on call as excuse to leave early - to ensure I’m home and able to respond to calls when everyone else leaves the office, not much I can do if I’m stuck in traffic, or in a tunnel, etc.

1 comments

Thanks - we try to tune our alerts, and we have a lot that are self healing as well. The ones I've been mentioning are ones that we currently don't have automated solutions for, and require me to manually action them. Our management team is working on automating away the work, but the technical debt is going to take longer to fix. We get some flexibility to leave early / start late when alerts affect our shift as well, although it's not worth the cost to me personally.