|
|
|
|
|
by zxcvbn4038
2456 days ago
|
|
My advice is to use your time on call to your advantage. Don’t address just the symptoms - when you receive a call try to understand the root cause and take steps to prevent that situation from happening again. For example - if paged for low disk space make sure log rotation is present, working, and aggressive enough to stay ahead of the generation rate. Have the thing that checks the disk space preform the most common remediation steps and then page only if unsuccessful. If your in the cloud then just kill anything that runs out of disk space, it’s the application owners responsability to arrange for long term storage, etc. Do this for every call you receive and soon your phone will be silent. My employer makes use of Pagerduty and I’ve spent a lot of time setting up “auto-resolve” of alerts. I even hook into AWS autoscaling lifecycle events and send mock “OK” actions when something gets terminated that had thrown an alarm. I still get paged but most issues solve themselves if I wait one more monitoring interval. I’ve also used being on call as excuse to leave early - to ensure I’m home and able to respond to calls when everyone else leaves the office, not much I can do if I’m stuck in traffic, or in a tunnel, etc. |
|