|
|
|
|
|
by mduggles
2457 days ago
|
|
I mean it depends on whether you are doing anything with the pages and if they’re followed up on. As someone who has been on various oncall rotations for a decade I would describe that as a pretty heavy paging load for an average rotation. The key criteria for me and paging are: 1. Was the page actionable? Did I need to do something to restore the system to functioning or prevent it from going down. 2. Can I prevent this page in the future and most importantly am I empowered by leadership to do that? If your app is paging me because it’s poorly made and I am not authorized to change it that’s a leadership problem that’s extremely common. 3. Are we auditing the pages? Often alerts in technology are designed in response to a particular problem and then never removed. Paging is, to me, a very serious action for a system to take. It means it is impossible for the system to naturally recover and all automation has failed. So every time we page someone we should as a team review those pages to ensure they’re actionable and actually impossible to naturally recover from. These criteria have served me well for years and caused me to turn off the vast majority of the alerts of my services. But you seem to have a culture that accepts this as normal and tbh these rarely change. Just know that it isn’t normal and it’s not acceptable. |
|
There is effort to try and resolve the underlying problems, and we do make some headway here - we just keep adding changes to satisfy customers which end up causing new issues. We're being told this will get better over time, but it's certainly not happening fast enough IMHO.
Again, thanks for the feedback and insight!