Hacker News new | ask | show | jobs
by Niksko 2456 days ago
I'm part of a team that operates a roughly 100 node Kubernetes cluster. I'm on call after hours for a week at a time, and am on call roughly every six weeks. I think I've been on call for three weeks this year, and I've been paged twice. Both of those were pretty straightforward problems solved within half an hour or so, with zero customer impact. This is roughly what other people in my team experience, probably averaging less than 1 page per on call rotation.

The question you should be asking is: why am I being paged so often?

Are they legitimate things that you need to respond to? If so, you should be fixing these issues so that they don't happen again. If anyone gets a page, we make it a high priority to fix whatever caused it. We are a team of 7, and we dedicate one person a week to field questions relating to our platform as well as to fix up these issues that wake us up.

If they're not legitimate things that you need to be woken up for, why are you being woken up? If this is the case, you need to make sure everyone is on the same page regarding what constitutes something you need to be paged for after hours.

1 comments

Thanks for the reply - I appreciate the insight and follow up questions.

> The question you should be asking is: why am I being paged so often? Are they legitimate things that you need to respond to? If so, you should be fixing these issues so that they don't happen again.

This is mostly due to not having anyone else around to handle customer issues (which currently require manual intervention), however system issues are also pretty frequent here as well. Management is working on prioritizing the automation of the customer issues so that there are less of them in total, but system issues will likely be harder to resolve (we try to resolve them as they come up if possible, but many are more systemic to technical debt.)

So yes - I'm only including the events that are actionable and require breaking out the laptop - these generally vary from 15 minutes to 3 hours of support.