| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by rfreiberger 1991 days ago

I'm not in development or engineering directly but work in an operations role (mostly Puppet and Terraform work) where I've been oncall for the majority of my career. One thing that is common when things get really bad is the mindset of oncall doesn't count towards my role at the company. Many people see it as the painful work of cleaning up while the others are out building the next big thing. So it's easy to see people jump into the shift, deal with the mass alerts, then leave without making any improvements for the next guy or gal.

One way we have been trying to improve this is working with PagerDuty reporting and looking at the total amount of interruptions (not just pages but anytime PagerDuty reminds you for an alert/expired snooze/escalation) with the team. It's very easy to forget the oncall as you leave, but having more eyes on the shifts starts to bring awareness and lots of "why is that still broken" questions that are better answered at 10am vs 3am on a Sunday. I came from a large Operation Center so I know the pain of bad alerts, mostly cya stuff where it was put in place just to make sure the last guy can't get blamed. Sort of like adding 100's of random smoke detectors in a build without any fire suppression. The intention is good but the results are poor.

Outside of the meeting with the team, we also have proper handoff meetings with off call and on call, so they can share what's going on verbally instead of tagging the next person with the alerts. Makes it easier to share what's going on, any weird problems, notes. Also we're not using a 24/7 oncall coverage but 12/5 and 48/2 for the weekends, it's a small change but helps so much. The worst I ran was a 7/24 at a major email company and was paged every three hours, for the entire week. After that I knew the team didn't want to change and I needed to do something about it.