Hacker News new | ask | show | jobs
by mduggles 2457 days ago
I mean it depends on whether you are doing anything with the pages and if they’re followed up on. As someone who has been on various oncall rotations for a decade I would describe that as a pretty heavy paging load for an average rotation.

The key criteria for me and paging are:

1. Was the page actionable? Did I need to do something to restore the system to functioning or prevent it from going down.

2. Can I prevent this page in the future and most importantly am I empowered by leadership to do that? If your app is paging me because it’s poorly made and I am not authorized to change it that’s a leadership problem that’s extremely common.

3. Are we auditing the pages? Often alerts in technology are designed in response to a particular problem and then never removed. Paging is, to me, a very serious action for a system to take. It means it is impossible for the system to naturally recover and all automation has failed. So every time we page someone we should as a team review those pages to ensure they’re actionable and actually impossible to naturally recover from.

These criteria have served me well for years and caused me to turn off the vast majority of the alerts of my services.

But you seem to have a culture that accepts this as normal and tbh these rarely change. Just know that it isn’t normal and it’s not acceptable.

1 comments

Thanks - of the 5-7 pages per week I was mentioning, all of these are things that are items that require me to manually action them. Lots are after hours customer support issues that require administration level access, others are systems issues tied to technical debt or legitimate problems that occur.

There is effort to try and resolve the underlying problems, and we do make some headway here - we just keep adding changes to satisfy customers which end up causing new issues. We're being told this will get better over time, but it's certainly not happening fast enough IMHO.

Again, thanks for the feedback and insight!

That comes back to the parent's comment about being empowered to fix the issues. The person on call should have power to prevent such calls in the future. This is important for the health of the individual and of the company.

Are the people in charge of fixing the underlying issues themselves on call? How about the people producing the changes that cause new issues?

If those two groups aren't themselves being woken up when there's a problem, you can reasonably expect that this won't change until the support calls start to directly affect the company's bottom line.

> Are the people in charge of fixing the underlying issues themselves on call?

Yes - although we're on call frequently enough, and tasked with other priorities when we're not - so progress is slow. I mentioned in another comment as well that the executive focus is to do pretty much whatever our customers want, so this generally results in lots of new problems by the time we fix older ones.

> How about the people producing the changes that cause new issues?

They are responsible for fixing the code, but they can do so more during regular 9-5 type hours. They don't feel the same level of pain. I realise this is a problem, but thanks for suggesting it.