Hacker News new | ask | show | jobs
by AdieuToLogic 631 days ago
> ... I oversee a team handling a moderate volume of on-call issues (typically 4-5 per week). In addition to managing production incidents, our on-call responsibilities extend to monitoring application and infrastructure alerts.

Being on-call and also responsible for asynchronous alert response is its own, distinct, job. Especially when considering:

> Often, the on-call engineers are pulled into working on production features or long-term fixes from previous issues, leaving little bandwidth for proactive system improvements.

The framework you seek could be:

- hire and train enough support personnel to perform requisite monitoring

- take your development engineers out of the on-call rotation

- treat operations concerns the same as production features, prioritizing accordingly

The last point is key. Any system change, be it functional enhancements, operations related, or otherwise, can be approached with the same vigor and professionalism .

It is just a matter of commitment.