| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by t-writescode 1993 days ago

I have worked for 3 separate companies in my decade of software development where I have had to be on-call. One of these companies was an organization at a very large and prominent software company.

The different on-call rotations worked out thusly:

1. on-call was 1 month long. Response times had to be very short. During business hours, there was a large queue of long-tail work that needed to be resolved that was outside my normal work. Most of the employees here were in their 20s and 30s, probably.

2. Small company. Probably 30 devs total. I was on a team of 1, 2 and eventually 3 people. on-call was 24/7 for my team. Response time was about an hour. I was the youngest employee and most employees here were in their 40s or beyond.

3. Smallish company. < 500 employees. Dev team size of 6ish. On-call is a week-long venture. Turn around time is very short, I think 30 minutes? On-call is a dedicated period. Most issues can be resolved during business hours; but, emergencies are handled at all times.

For [2] and [3], there were unwritten patterns around how much you really needed to be at work once your shift was over if on-call was particularly bad.

At [1], the on-call was particularly long and harsh for a couple reasons. In the early days, I heard that the on-call was absolutely horrible. Logs were non-existent, errors were terrible and required a great deal of work. But, it caused developers to feel the pain of not logging properly, not handling errors correctly, and not monitoring usefully. Over time, those issues were resolved, the team has incredible logging and incredible tooling, knowing that they're going to be the ones that have to fix it this time.

At [2], the constant trouble of code prior to my time there caused the developers of the old code to make it more stable. The services eventually became auto-resolving, we had a network operations center (with appropriate work hours that covered the whole day) that had playbooks for all the remaining normal issues; and, the bad stuff made it to us. On-call 24/7 meant I might get called once every couple weeks or less by the end of my tenure there. I lived a normal life.

At [3], we're still learning and the code is in constant churn. Issues come up and we attempt to fix the root cause on most of the issues. Our logging has gradually improved and our monitoring has been improving and they're tweaked to find real issues.

My thoughts:

I think on-call is an important experience for developers. Developers should be first responders for their code when it hits production for the first day or two to catch any possible issue.

Developers should know the pain of deploying their change at noon or on a Friday at 5pm, or at 11pm on a Wednesday, so that they accept responsibility and importance if it breaks at those times, and those actions should be above and beyond their on-call rotation.

If the work of the on-call is especially intense, it should be a separate role that the developers take, with a rotation so that that's all that specific developer is working on.

Developers should write code and review code with debugging and tracing and monitoring and self-correction in mind, to reduce on-call pain - and one of the best ways to do that is to make them feel it, themselves.

If your code-base is having as many issues as you suggest, there are probably some common areas and pitfalls that the code has, and maybe they'll be patterns the team can implement each time those same issues come up. As a result, those errors won't come up as frequently.

If the monitors are too noisy with non-errors, then a couple things could be going on. Let's say that the code 500s when someone passes an invalid argument, or a record isn't found. Those probably shouldn't be 500s, so the code needs to be updated for them to not be. On the other hand, if there's a monitor checking for more than 5 401's in a minute, maybe that's a bit strict and should be changed to "more than 10 401s a minute, every minute for 10 minutes; OR more than 200 401s a minute" - that way you catch the big ugly case of "our auth service is down" and aren't caught by people failing to enter their password a bunch (but giving up).

If the code is an absolute and unfixable mess and you don't want to help fix it, if management is not interested in improving common pitfalls, then maybe it's time for you to look for another job.

Here's some additional reading: https://sre.google/sre-book/being-on-call/