Hacker News new | ask | show | jobs
Ask HN: Any luck negotiating better terms for on-call?
101 points by ymnska 1991 days ago
(Throwaway account.) I work for a large software company on an on-call rotation that’s been getting more toilsome, and wondering if anyone has been in a similar place.

Like many SV companies, on-call isn’t compensated with the rationale that it’s part of your engineering duties. I buy this to some degree—someone does have to be keeping an eye on things—but it's complicated by sizable inequities across the org. _Most_ people have no on-call rotation, many others have a token rotation that’s ~never used, and only a handful of teams have rotations that are quite bad. Management has extricated themselves completely.

Things have been angling slowly worse. In a gambit to prioritize uptime over engineer time, we have more alarms, tighter tolerances, and a larger operation that generates more tail problems. Good for users, but not so good for us. Being able to sleep fully through the night is increasingly rare. There are some false positives, but most are not, and not easily fixed by more engineering.

Expected time to response has lowered to low single digits—theoretically, you should not be exercising or driving if you’re on. The scheme works because many engineers are in their 20s and willing to soak up pain like a sponge. Rotations tend to smaller over time as single people make backroom deals to get out, and new blood is added too slowly.

I’m not trying to get myself out, but want to effect some kind of change. IMO compensation or extra time off would be ideal—not only is it a nod to the cost of on-call, but it also make exchanging shifts easier by adding incentive beyond simple goodwill. The company could easily afford it, but probably doesn’t want to pay for what it can get for free.

I have frequent conversations with my manager and get token “yeah, we’re looking into it”s, but it’s obviously not a priority for anyone up the chain. Has anyone else been in a similar position? Are you paid? What did you do? Suck it up? Leave?

49 comments

Former SV engineering manager and engineering director here:

Have you had a conversation with your skip-level manager? If so, then you are probably right that it's not valued up the chain and you should leave because that is a total shit show that is not the norm.

If you haven't, reach out for time on their calendar, and write down your data points on on-call wake-up rates, total of alarms over time, and let the data make the point that this is not sustainable.

The Director should have some options. How big is the rotation? Is the manager in the rotation themselves? When you're on call are you also expected to contribute story points to the sprint? Why are you not able to solve underlying engineering issues that are causing the SLO violations?

If you came to me, I would be shocked, and immediately make a plan with the engineering manager. Any time a person is woken-up by an alarm it's an incident. There needs to be a response to every incident. There needs to be some serious bar-raising and you can't do it yourself. You need an ally in your management chain and if you don't have one, you're better off transferring teams or companies.

At some point, OP and their coworkers should be refusing to work on any new code that isn’t at least superficially in the service of reducing these preemption events. If coworkers aren’t concerned about that, it’s probably best to move on.
any manager who disregards potential churn and/or mass walkouts is probably too obtuse to be talked sense to
At big companies, it should be possible to have enough engineering offices to staff oncall with normal 8 hour days. (8 hours in Tokyo, 8 hours in London, 8 hours in San Francisco, or similar.)

At startups, it's harder; you simply can't have 3 dev teams on 3 continents, so someone is going to have to be around at night. The balance we found at the company I work at is that you are oncall Thursday-Thursday and get Friday off. (Not "not oncall", but "don't show up".) This seems fair to me; the free vacation day is really nice! (We're hiring! https://pachyderm.io/careers/)

When I was at Google, I worked on Fiber and we didn't have a dev presence around the world, so we had to be oncall after hours. We had a dedicated operations team with people paid to be at work during strange hours, as you'd expect from an ISP, but some issues were escalated to the dev team, and we had to be around for those. I was also the TL for a monitoring system that informed operations of outages, so my team would need to be around to handle monitoring monitoring ;) We just got paid extra for every hour we were oncall, I remember it being something like $1600 per week, but I forget the exact number. I was happy with this arrangement. Other people weren't, and weren't asked to be oncall, and it didn't count against them in any way. It all seemed fair to me.

> At big companies, it should be possible to have enough engineering offices to staff oncall with normal 8 hour days. (8 hours in Tokyo, 8 hours in London, 8 hours in San Francisco, or similar.)

That arrangement is commonly known as "follow the sun support" (not particularly at you, it's just a good piece of jargon to know).

It comes with it's own set of issues. I work on a team that does follow the sun support, and while it's great for handling ops issues, it makes dev work much harder. We're not a large team, so it evens out to 2 people per region (one of which is always "on-call" and can't do dev work). The communication costs from time zones are real, and it makes everyone's context on what is going on different because they see updates from different regions.

> I remember it being something like $1600 per week, but I forget the exact number

No wonder, that's a pretty generous on-call stipend. I've worked places that paid for on-call, but never that well. It was usually more of a token amount, like $200 or $400 for the week. I.e. far less than it would be if you were paid your hourly wage (averaged from salary).

Overall, I think "follow the sun" is a great idea for teams that are generally not forward looking. It's hard to communicate on forward looking projects, but it's easy to hand over operational issues. I would absolutely do it for a NOC-type team, but I would have to think about doing it for a dev team that needs to handle after-hours issues.

Usually follow the sun support doesn't mean devs around the world working on the same things. That would be harder to coordinate for sure. A big enough company is probably better off hiring SREs instead of having dev teams where every meeting time is bad for someone. And a small enough company is probably better off outsourcing 24 hour coverage.

Why can't the on call people do dev work though? Having someone on call and on deadline isn't realistic. But there should be time between on call issues. And in most code bases there are things someone can work on without coordinating with the rest of the team every day.

Google does hire SREs actually. It doesn't mean dev teams aren't on call also. But dev teams do get paged much less frequently.
> The balance we found at the company I work at is that you are oncall Thursday-Thursday and get Friday off

If I'm reading this correctly, the expectation is that the on-call person doesn't sleep for a week, and that a single day off work is fair compensation.

I think you're being pretty disingenuous here. It is very unusual for anything to come up overnight and being paged is the last resort. If there was a sleepless night, we'd make sure you weren't oncall anymore that week. Oncall is about being available to be called within a certain amount of time, not being awake to keep an eye on things. Software keeps an eye on things.

Nobody thinks a 24 hour oncall rotation is optimal, which is why companies distribute themselves throughout the world and simply have people at work somewhere 24 hours a day. But even at companies like Google, it's not always possible. You have to balance working on a small team and moving quickly versus having triple dev-team redundancy.

Some other workarounds are:

1) Hire someone to be awake during off hours. They won't be around with the rest of the team, so probably won't have the same understanding of the service that they are responsible for supporting. Personally, I've never seen this work well -- both teams see each other as "out of sight, out of mind" and don't really help each other.

2) Ignore all issues between 5PM and 9AM. This is quite possible to do, and might be the right thing for certain companies.

3) Hope nothing bad happens, and when something bad does happen, call everyone on the team frantically hoping someone will be awake and answer your call.

Like I said, I'm happy with the balance I have at work. I think it gives our customers the confidence they need to trust us, while giving engineers a decent work-life balance. I'm just some random engineer; I didn't start this company or force this upon others. I chose it for myself.

I shared my experience because I think it's relatively unique (with the OP's experience of mandatory uncompensated work the norm), and I like it.

> It is very unusual for anything to come up overnight and being paged is the last resort.

This does not jive with my experience. Most companies aren't Google as you've described it, and in most cases the person on pager duty is the first human examining the incident.

> If there was a sleepless night

How about if they get woken up each night for an alarm that turns out to not be a big deal? That is the typical on-call experience, getting woken up for 15-30 minutes each night, cortisol from 0 to 100 in the 15 seconds it takes to get into Work Mode.

I guess that doesn't qualify as "sleepless" but between it and the general stress of not being able to turn the phone on silent, I'd call it "shit sleep." Nobody should be subject to it. How can you expect somebody to produce decent software in this condition?

That is the typical on-call experience, getting woken up for 15-30 minutes each night

There is no "typical on-call experience". Some teams have an on-call rotation that goes a year without being used. Some oncalls get paged once a week and it's a serious issue that will take an hour or two to resolve. Some oncalls are impossible to handle, with alerts every few hours.

How does your experience at some other company help understand this guy's company. Makes no sense as a response.

This is like when I said I once had unlimited vacation and I took eight weeks off a year for three years and people were like "That's not my experience". Okay, well, sucks for you. No one can do anything with that.

> That is the typical on-call experience, getting woken up for 15-30 minutes each night, cortisol from 0 to 100 in the 15 seconds it takes to get into Work Mode.

The only company I ever had that happen with was the big company. The other two companies I've worked with that had on-call experiences, if anything like that happened, we would be tweaking alarm levels so it didn't happen anymore.

If you're not tweaking alarm levels or fixing code to clear out false alarms, it's not a sustainable on-call rotation and that needs to be fixed immediately.

I've been the solitary on-call for the main service of a company before and I almost never got called because 1) we had good KB articles for the operations center for when things did break; and, 2) things very rarely broke in a way that wasn't automatically fixable

It's amazing how many cases "remove broken machine from pool automatically and then restart service and bring that machine back on service crash" is a valid fix for the weird, extra edge case junk that would otherwise be a call.

I've experienced this at small startups and BigCorps. Granted, at the BigCorps fewer things blew up in general, and when they did, it was interesting.
> How about if they get woken up each night for an alarm that turns out to not be a big deal?

Write a post-mortem on this "crying wolf" fact. It is definitely a bug in your alerting rules, so actions have to be taken, otherwise others will routinely ignore important alerts.

On my teams, we've come up with this: If you get woken up in the middle of the night for an actual incident (longer than 15 minutes after 10:00 PM), you can take the following morning off (half day) to catch up on your rest. If the incident takes longer than an hour to resolve (after 10pm before 7 am), you can take the following day off once you have your paperwork in order.
I've had on-call duties in most of my positions for over 25 years. In my experience, if the experience for the on-call person is terrible then you probably need code/infrastructure improvements to make it more stable and/or more hands to do the work.

If the company is not allocating the proper resources to the issue and its affecting you personally then you need to leave. You have a business relationship with work, don't let it become personal.

Exactly, if on-call is hard it means everybody is doing mediocre work therefore you are probably not growing/learning. If you don't want to leave, start thinking of what parts are the worst offenders and tackle those as high priority (harder than it seems).
Hard agree here
100% in agreement. Is how I've been running my teams in all the companies I've been to too.
Leave.

Speaking only of your situation, your company isn’t going to appropriately comp you for the on call burden, and they’re going to string you along (“we’re working on it”) as long as they can. If you stay, you will continue to suffer, and unless your comp is exceptional, it doesn’t appear to be worth it.

They might change after enough folks burn out and/or leave, but that’s not within your control. Your quality of life is within your control.

Yea, definitely agree. I've had murderous oncall shifts at some companies where the expectation was to just deal with it. Life is short. Bad oncalls affect your health and happiness. You can only do so much to change your org from the bottom rung. Look for a solid company that invests in improving the app enough and one where you can be happy.
If you don’t already know at least four things that are being worked on, nobody is “working on it”.

We do rotations, two shifts. I spend at least two days a month working on alert prevention or faster recovery. So does my whole team. If anything big happens, I’ll spend four to six days (I tend to volunteer for resiliency work. My standards are higher, and I can actually talk about human factors instead of staring blankly or blamecasting).

So while other things are being “worked on” I can almost always name three of concrete ones we’ve done recently.

You might be onto something there (and indeed that might be the inevitable result in this case), but it always feels a little icky to just use the nuclear option without trying to affect change. Even if I go, the rest of obedient-but-very-likable team would be left slightly worse off.
It's not the nuclear option. Put your feelers out, see if you can get a better job that you like the sound of, and then leave.

_That_ is how you effect change when your management and leadership is comfortable with the status quo, and if the colleagues you're fond of are too loyal to the company for their own good, maybe it's the kick in the ass they need.

Beyond that, your mental wellbeing isn't going to withstand the pressure of taking on a burden you don't appreciate so that your colleagues don't have to suffer as much.

You can change your organization or change your organization. The latter is often easier.
The nuclear option is the only option which normally brings change when enough people make use of it. Everything else is just kicking the can down the road, but nobody will ever pick it up.
> on-call isn’t compensated with the rationale that it’s part of your engineering duties

This is twofold; namely your team and management should be aware that you aren't available for normal work capacity when you're on call.

> theoretically, you should not be exercising or driving if you’re on

This is not possibly sustainable; Your company needs to have someone else available, a backup in case one person misses an alert, someones for at least the other 2 shifts, and someone that can cover while driving, eating, exercising, or using facilities.

Your company is just lying to itself if it believes it has any coverage.

> In a gambit to prioritize uptime over engineer time, we have more alarms, tighter tolerances, and a larger operation that generates more tail problems. Good for users, but not so good for us.

This sounds like the crux of the problem. Your company has prioritized rapid fixes over sustainable engineering. The bandaid may be repeatable, but that doesn't make it sustainable with growth. The most simple solution, is that for every amount of time spent on call 2x as much time should be spent in resolving any tech debt that leads to such a situation.

> IMO compensation or extra time off would be ideal

I think that you should negotiate this based solely on the fact that you can no longer sleep. Aka, you should take off days for every night you work

We have a pretty tight oncall (5 min response time).

I think the steps you can take are:

1. Make it clear to your manager this is unacceptable, and you will end up looking for alternate teams/jobs if this goes on

2. Make the same thing clear to your skip level

3. Quit / change teams, citing oncall as the issue

There's no point of doing anything else, in my experience. It's someone else's job to make sure that your oncall experience is prioritized. It sucks to leave an otherwise good job.

For extra credits - try to propose some solutions. Why are some issues not solvable by engineering? Would simply resetting expectations mitigate the largest issues/waking up at night?

Thanks!

> Make it clear to your manager this is unacceptable, and you will end up looking for alternate teams/jobs if this goes on

I'm trying to do this in as harmonious way as I possibly can, but I'm a bit worried that getting really contentious about it might have negative repercussions. It's possible that I'd "win" and allowances would be made, but it's also possible I'd end up making some real enemies and/or put on a track out the door.

One hopefully-unusual circumstance here is that most of the rest of my team (and in fact the company) either don't mind the situation much, or at least aren't openly vocal about it, which makes me look like that one nail hanging out that's ready to be slammed back down.

> Quit / change teams, citing oncall as the issue

This is probably the inevitable solution unfortunately, although I will feel bad exiting (making the rotation even smaller) and without having moved anything in the right direction.

> Why are some issues not solvable by engineering? Would simply resetting expectations mitigate the largest issues/waking up at night?

Yeah, agreed. This is the obvious way out if at all possible, but there are many types of alarms where it's fairly difficult. For example: (1) cases where there is a big problem and we get paged essentially as a side effect of one failure causing issues in our part of the system, or (2) catch-all alarms designed to page when something looks suspicious enough to merit human attention, even if not a known failure case. There's a strong attitude of err-on-potential-issues, so relaxing any of these tends to be a no-go politically.

> I will feel bad exiting (making the rotation even smaller) and without having moved anything in the right direction.

FWIW, I've quit jobs on short notice because of poor conditions like this, and my leaving increased pressure on those who were still there. These are good things to keep in mind:

- They are free to resign too.

- Their predicament is entirely the fault of the employer, not you.

- Employees are often willing to soldier on out of a sense of duty to their coworkers, which gives the employer no incentive to change. To the company, it's a case of "if it ain't broke, don't fix it".

A former employer of mine once had all developers working 60 hour weeks because "this is what it takes to be competitive in the industry". The staff grumbled and complained, but it wasn't until there was a mass exodus of senior developers that they suddenly discovered the value of happy employees. That company is actually quite a nice place to work now. Some executives are incapable of seeing the error of their ways without real consequences.

There will likely always be someone willing to fill any position regardless of how abusive it is. There’s no reason to suffer because other people have decided they’re going to suffer at a miserable job. There’s also generally a huge pool of talent that management could find a way to accept for a position, if they’ve gotten the correct incentives.
They can definitely find replacements, but they're still hurt by the knowledge that walks out the door with their former employees. My previous employer had a large decade-old (at the time) codebase that was only well understood by people who had been there since the beginning. Losing all those SMEs was a painful blow, which was why they finally changed their policies.
When they have alerts that tightened down and want them responded to that way - I suspect they’re institutionally or personally paranoid. I’ve seen this happen when your department is the object of derision from other departments. Ie when IT is treated as a cost sector. This is a huge red light and a reason to run for the hills. Perhaps the nice option is it’s just your immediate management who’s overreacting and switching teams can make it better. But I file this combination as a grave condition to avoid at all costs
Get a course on monitoring, so that you can use that to speak with authority and tell others that they are in fact incompetent. Spoiler: case (2) is a big no-no for paging alarms. And "err-on-potential-issues" is a synonym for "crying wolf", which is an antipattern. All paging alerts must be actionable and have a playbook, period, and no exceptions.
> I'm trying to do this in as harmonious way as I possibly can, but I'm a bit worried that getting really contentious about it might have negative repercussions.

You mentioned "contentious". Are you concerned that the conversation can't remain friendly, professional, and cordial for some reason?

On feeling bad because of exiting: You're giving management one more data point that this situation doesn't make sense.

Engineering manager don't care about complaints you're making while smiling and trying to be nice, they care about employees churn rate.

Hopefully things will change for the good in the future.

We have a pretty tight oncall (5 min response time).

A previous employer wanted to drop the response time from 15 to 5 minutes. That was the straw that broke the camel’s back and everyone refused to do on-call until we got new contracts which paid for on-call quite generously. Management pissed and whined of course but a year on, the on-call payments were dwarfed by the savings made.

> pretty tight oncall (5 min response time)

How long is each oncall shift?

Each shift is a week long.

We're fortunate that our team has a European counterpart, so we don't have to respond at night. We do 9:30 am -> 9:30 pm, and there's 5 members in our rotation.

If your expected response time is of the order of 5 minutes, then you are not "on-call", you are working 12 hour days and your compensation and time off arrangements should reflect that.

I suspect that if the company is currently getting that amount of extra work (over and above a normal length working day) for free, then you're unlikely to be able to get them to change that. If it was me, I'd be looking for a role in another team or company that has a more realistic approach to on-call.

Any potential extra impact on your current colleagues that you leaving might cause is the responsibility of your management and up to them to mitigate. How your current colleagues decide to react to the on-call situation should be up to them.

Good luck resolving this, I've been in work situations that had unreasonable expectations myself and I appreciate how stressful it can be.

>If your expected response time is of the order of 5 minutes, then you are not "on-call", you are working 12 hour days

I'm a new dev at a fairly young startup. We have recently started an oncall process and we have similar response times for oncall though our workload isn't nearly as heavy since our scale is low. What's the standard in oncall response times/expectations?

I don't think there's any real standard, since it very much depends on application SLAs, industry sector, size of the on-call team, geographic location, length of on-call rotation, frequency of call-out, how realistic the management are, how much inconvenience the team members are willing to put up with, etc.

For example, I work in London and it would be unreasonable to expect that someone could travel between home and work on public transport and still meet a response SLA less than one hour. That would likely be a different length of time in another location, or if people worked 100% remotely, for example.

My opinion is that if you have a response time less than say 30 minutes, then you actually need to be compensating people for sitting in front of their computers ready to respond immediately, whether that be in the office or remotely.

Unless call-outs are very frequent (in which case there are underlying reliability, capacity management, and/or alerting issues which need to be resolved), then on-call isn't really about the extra time spent working, but the restrictions on what one can do whilst on-call.

To use a fairly simple metric: if an on-call SLA means that I have to be concerned about whether I can pop out to a local shop or how long I can spend in the shower, then I don't think that I would be on-call, I would be working.

Of course start up environments (especially early stage) are always different from more corporate environments and there are generally greater resource constraints in general. For a start up I am usually looking more at what valuable experience I can gain, rather than maximising remuneration (subject to a certain base-level of course).

However ultimately the question remains the same: do I think that what I am getting out of this role is worth what I have to put into it? There are probably roles in which I'd be willing to put up with the inconvenience of very short on-call SLAs, because either they paid very well, or I was gaining very valuable experience.

Whether a role fulfils ones own expectations for the reward/expenditure ratio is a question that everyone has to decide for themselves.

Wait, so you have to be available within 5 minutes between 9:30AM -> 9:30PM for a whole week?

I hope you are getting paid a lot. What happens if you get paged while taking a shit? Do you get reprimanded if you can’t get off the toilet in under 5 minutes?

You need to pressure them. They have no incentives to change this system, so give them incentive.

There are so many red flags in your post, if I were you, I'd start looking for my way out immediately (I don't know your personal situation so ymmv). Company culture changes slowly, and based on the number red flags, I'd say it's probably easier to leave and find something better. If you are staying because you like working with some people, ask them to consider joining the same company where you get hired. If you try to change the organization, be prepared that people will stop liking you because you are not the obedient little code monkey anymore and they won't like that you don't let them exploit you anymore.

> on-call isn’t compensated with the rationale that it’s part of your engineering duties

That's just wrong, do not let them convince you this is normal. If they want you be available at all times, then they should pay. If they don't want to pay, tell them you won't do on calls anymore.

> In a gambit to prioritize uptime over engineer time

I see they are very generous with your time.

> Being able to sleep fully through the night is increasingly rare.

Again, it's a sign that your system is unstable. You need to ask them to prioritize fixes to these issues, even if they can't be solved easily. Take a good look at how development is organized. Do you have automated tests, code reviews, knowledge sharing? Are you always working on features and ignore bugs? Running systems should not be this hard.

> yeah, we’re looking into it

This is an acceptable answer exactly once.

> The company could easily afford it, but probably doesn’t want to pay for what it can get for free.

I didn't want to be philosophical but: Power concedes nothing without a demand.

My suggestion would be to take care of you health. There are companies that can afford to pay you for each and every extra hour of on-call, but they won't necessarily look for you well-being. Having extra time off sounds nice, but on the other hand it will affect the other stuff that you usually have to do at your job.

I was hit by on-call duty pretty hard at some point in my career. I was sleep deprived and was not able to execute on my regular tasks. This also lead me depression and increased my anxiety. Even though I've started to work on my issues with therapist, I was not able to recover and was let go.

Remember about taking care of yourself.

Thanks! I appreciate the thoughts, although it doesn't sound like I am as bad off as you were in this case.

I can hugely sympathize with on-call bleeding into other parts of your life/career though. One of the things I detest most about the current setup is that on-call is considered to be a 100% "extra" obligation. Even if you were up all night responding to pages, you are still expected to be back in your seat on time the next morning and working at full throttle. Unless it's a really outlying case, then no allowances are made on adjusting other expectations for the work you're doing outside of work hours.

It sounds like this is what happened to you. Sorry to hear it, and hope you found something better.

It being additional to normal work is not typical.

I would find a new job.

I gave up this idea for couple of reasons and instead am negotiating for whole package which I just assume includes some amount of after hours / on-call work.

After a lot of thought, asking for on-call is a loose for me. First, there are many people willing to not ask for on-call. Second, I don't want to be associated with the ones that do. Third, even in best case the on-call doesn't seem compensate me enough for spent time. Fourth, it makes it hard negotiating my base salary which is where the money are. Fifth, it puts some unhealthy motivation to spend even more time at work rather than be more efficient with it (for example, work to create environment where I don't have to be on-call or have to spend less time after hours in general).

So, instead, I am showing I take ownership of the area I am working in, I am willing to sometimes decide that the project requires me to spend some extra time, that I am happy to do what is needed to get the job done, and I try to sell it to my company as a complete package.

Thanks. Not exactly the answer I wanted to hear, but still a useful one nonetheless. I suspect that a great many people in SV have a similar policy, even if they've never consciously thought about it before.

I've always had a similar one for the last ten years, and have only been reevaluating it recently as there's been this intersection of (too many days on the clock) ∩ (too many pages) ∩ (pages from factors beyond my control) ∩ (reduced leeway in time-to-respond). It does eventually become a problem.

The fact that it's not a priority up the management chain is a red flag. Try speaking with the skip-level manager or higher to see if you get some attention. At the very least, I would ask for days off to compensate for a bad rotation. Is fixing root cause prioritized? If you are just adding more alerts and not prioritizing fixing underlying issues, that's another red flag. Try to see if you and your team members can get your management to prioritize this. Having a group of people asking for this can be more effective than just one person. If you have enough leverage, push back on new feature work until fixing oncall problems becomes a priority. If all of this fails, leave - there are any number of places where oncall is not as much of a burden and/or you are compensated for your time. Good luck!
I worked as a developer and consultant at a big name SV company and was not compensated for "on call" time and in the end, not knowing any better just "sucked it up".

Now I would definitely ask some questions if I felt that this responsibility was falling disproportionately on my shoulders.

Do all employees at my current level have the same on call responsibilities and schedules? If not, how are custom schedules arrived at?

Do more senior employees work on call rotations and if not, at what job level are they excused?

Are a couple that seem very reasonable to me.

100% agree that these are questions that should be asked, and if you're in position to, hard-balled as a condition of joining something.

It's more complicated in practice than it sounds though because before you join, you'll get a manager well-versed in giving you either non-answers or "soft" lies. For example, they'll say, "Yep, we have an on-call schedule. People go on every few weeks (probably meaning: 3-4 days every few weeks), and it gets a few pages, but isn't too bad." They'll generally refuse to get into specifics, and most people who are excited about the new opportunity, won't drill into it.

After you join, it's kind of too late. Best case you've got a team who's also willing to go to bat with you to get some fixes in, but more standard case, you'll find yourself on the wrong end of a pager without a huge amount of negotiating power unless you include the nuclear option of just getting out.

I've had one job where they pulled that. One week a month I was "on call", which meant waking up at 3am most nights and working through to office hours babysitting/bullying the system. In theory we got time off to compensate, in practice if we arrived back in the office after about 9am we'd get funny looks and the boss would pop past to ask what the problem was. "I just worked 5 hours" did not matter, we got that one hour of extra time off and that was it.

There was no engineering effort to fix these problems, it was just accepted that "sometimes the overnight jobs don't work".

When I quit I made it clear that that was the reason. And even though I quit during the 90 day trial period I still gave them a week's notice. The boss was not happy, but was adamant that the system he had put in place was working well and did not need to be changed. He wanted me to be on call during my weeks notice period!

There's two fixes: technical, by stopping the callouts from being necessary; and stopping management from imposing the callouts. Or you can get paid enough to make you happy with the callouts. If you can't do one of those you really need to get out.

If I were you I'd go have a talk with your union rep, especially as this is impacting multiple teams. They can then raise this higher up the management chain.
This right here is a great example of why unions are essential. They're the only bargaining chip employees really have.
Well not really but the only other real bargaining chip is the employment contract.
Union rep? What do you think this is, Google?
This should not be taken lightly, your life is effected by this since you mentioned you cannot sleep at night, nor exercise not drive. As the company advances your duties will increase and personal time gone.

Best thing would be to raise this with your manager, if no real action is taken then leaving or changing team is an option.

Being oncall and paid for it is much better. Here your personal time being lost with no compensation is simply not worth it. In fact if you don't respond in time it may reflect badly on you.

There is an approach that can be taken to focus sprints on only improving oncall but it requires management buy in. How bad is it? Is it something out of your teams control or is it something if you spend an hour over you can fix for good?

Thanks!

> Here your personal time being lost with no compensation is simply not worth it. In fact if you don't respond in time it may reflect badly on you.

Yep, you're onto something there.

A particularly insidious effect is that although no one notices the pages you do get to, they definitely miss the ones you don't. I have something like a 99% hit rate, but I've been given a very hard time on the few that I've missed.

> There is an approach that can be taken to focus sprints on only improving oncall but it requires management buy in. How bad is it? Is it something out of your teams control or is it something if you spend an hour over you can fix for good?

One issue is that many pages tend to be bucketed in err-on-the-side-of-caution type alarms. Like they might not even be directly indicative of a problem, but often are, but basically just need to have a human take a look.

Another is that although it's bad, it's not that bad. I've seen teams at other places who've had it much worse and indeed we're not even the worst off at the company. We have the typical corporate problem of underwater-all-the-time, so improvements aren't super likely to be prioritized unless things move even further south.

Read the SRE book chapters on paging and alerting. This is an unhealthy workplace. There should be a strong focus on every page requiring specific action.
> One issue is that many pages tend to be bucketed in err-on-the-side-of-caution type alarms. Like they might not even be directly indicative of a problem, but often are, but basically just need to have a human take a look.

If you wanted to get all passive aggressive about it - consider calling up your manager and skip manager every single time one of these intrudes on your non-paid time.

(This is a really bad idea BTW... It _might_ help if you're on friendly terms with them both and they're really just oblivious to the problems, but as you've indicated that you're already worried about blowback from even mentioning it... You'll almost certainly be better off walking. The "extra load" on the rest of the team is totally a management problem, not yours.)

The only people getting paged by "err-on-the-side-of-caution type alarms" should be the people who set them up (or the people who asked for them to be set up).

At a previous company, on-call was optional and we were compensated for that week when we were on call (an extra $500, IIRC.) About 50% of the team of 12 participated, meaning your week came up basically every month and a half. This seemed fair.

If your company doesn't want to pay for the aggravation, put your phone on silent, make sure alarms escalate to your manager, and start looking for another job.

> If your company doesn't want to pay for the aggravation, put your phone on silent, make sure alarms escalate to your manager, and start looking for another job.

There was a story floating round a company I used to work at, where (well before my time there) a manager had dropped the "on call" phone onto the desk of a dev with no notice, who was just told "$otherguy just quit, so you're taking the rest of his on call week". Said dev was leaving for (pre arranged and booked/paid for) vacation the next day, so he just surreptitiously slipped the on call phone into the managers briefcase and left for the week...

Manager ranted for a few hours about getting woken up, until senior management heard about it and very publicly slapped them down for it.

> At a previous company, on-call was optional and we were compensated for that week when we were on call (an extra $500, IIRC.) About 50% of the team of 12 participated, meaning your week came up basically every month and a half. This seemed fair.

That seems roughly fair to me too. Strangely, that's always how it worked at non-SV companies—of course you get compensated if there's considerable off-hours burden. For some reason though, SV seems to have established a new standard here. Maybe because people are often younger and/or already paid "too much".

> If your company doesn't want to pay for the aggravation, put your phone on silent, make sure alarms escalate to your manager, and start looking for another job.

Hah, well not ready to burn the bridge to this level quite yet, but yes, unfortunately that's the only option that's a guaranteed solution.

Companies should have permanent support teams that work shifts, and all they do is support work. That way people who are happy to do shift work (i.e. work nights sometimes) and get paid more for the trouble can do that, and regular engineers can not have the stress of having to be on-call. I think it's absolutely ridiculous that I have to be willing to get woken up at 3am and work on something, and then do a day at work the next day.
Sometimes, it's difficult to get paid extra due to organizational barriers. Then a reasonable option is to get time off. For each hour you lost on sleep, perhaps that day you stop working earlier 2 hours. This need not be discussed, you can simply tell your manager.

Otherwise, a little extra involvement might be necessary. Ask your manager's private phone number. When there's a problem, share your problem with him. In the middle of the night. They may get some new insights in the difficulties.

I think disruption of sleep is so detrimental to health that each call should be compensated with one day off (if you are not getting paid for it)
> This need not be discussed, you can simply tell your manager.

Thanks. Yes, I suspect that this is informally how it often works for people taking the worst of the brunt. What's a little irking though is that it's not formalized in any way, so only the people willing to go out of their way to be the squeaky wheel would get the benefit. Everyone else just swallows the extra hours.

I know that's how a lot of the world works, but it's not very satisfying.

> Sometimes, it's difficult to get paid extra due to organizational barriers.

That is total manipulative management/HR bullshit.

I would immediately counter with "It's difficult for me to be on call outside office hours due to personal life obligations."

Their poor organisation is not your problem. Their need for additional work outside the hours for which they are paying you is totally their problem.

> Sometimes, it's difficult to get paid extra due to organizational barriers.

Check local laws, re-examine your contract, complain to the state. Here in Russia 2x payment for work in unusual hours is mandated by the law.

Where I work, we're paid 1/3 of our regular rate for every hour on call. People are much less likely to want you to be on call for no reason when that's the case.
Extra compensation or time off is usually a bad idea for on-call responsibilities, because it puts the wrong incentives in place. Teams should be working to improve their infrastructure so that on-call is less painful, not lobbying for additional pay because someone built a hard-to-maintain system.

Sometimes this is an "up the chain" type problem, but if the other engineers on the team don't agree with you that the on-call rotation is too painful, it's going to be hard to convince management that your judgment is correct.

If you don't want to simply switch teams, my suggestion is to think of what engineering work you can do in order to improve the on-call experience. Then propose that you work on these projects, to your manager. Quantify the amount of engineering time and increased reliability your projects will save. In my experience it is far easier to get management to agree to a specific plan to improve the situation than to get management to find someone else to solve a problem for you.

Another idea - since you work at a large company, there are probably teams who handle this very well at your company. Infrastructure teams who have scaled components that in the past have been overloaded and now are widely used within the company, that sort of thing. Try asking for advice in a "horizontal" way, finding experts on other teams and asking how they have solved these issues in their teams. These "horizontal" experts will be able to give advice that's specific to your company. This is especially true if your team is working on a product area and your coworkers are not specialists in making reliable systems, but your company has infrastructure specialists on other teams.

> Extra compensation or time off is usually a bad idea for on-call responsibilities, because it puts the wrong incentives in place.

I sort of agree, because yes "just throw a little money at it" is the wrong response. But more money is definitely part of the answer, because unless you negotiated that amount of overtime when you signed up you're not being paid appropriately for your work.

> think of what engineering work you can do in order to improve the on-call experience

This is key. During your incident response review for each incident it's important to also keep a summary of overall incidents so you can use statistics to properly prioritise your engineering effort.

It sounds as though none of that sentence applies to the OP, and none of it ever can. Which means the advice to get out is about all that's left.

In my current job I get a token on call allowance (~2 hours pay a week), and it's expected that I will respond to problems, fix/restart/hack the immediate situation into something that works; then come in during normal work hours and analyse the fault, come up with a plan to stop it happening again; and implement the plan. Note that only the immediate fix is "after hours". Some of the fixes are significant - we're re-writing chunks of C++ code in Rust because there are weird memory issues{tm} in the C++ code (because of course there are). Other fixes are trivial, an assert fires and we say "oh, that can actually happen" and code accordingly.

Right now the on call allowance feels like money for nothing, because we have had two alerts in the last three months but they're paying that allowance to 3 people every week. The boss says "you're doing very well, keep it up" because in his view no problems is a good thing :)

Fundamentally I’d suggest you share this pain with your team, including any product managers and management/business development teams.

Change your thinking and approach.

You can do this by culturally re-prioritising the development teams workload to fundamentally treat the root causes for any outage and regular alerts as urgent to be resolved.

The work needed to fix the root cause gets to kick something out of the current sprint to be attended to immediately.

The dev/product team should fundamentally agree the alerts should be rare, not regular.

Instead of just tweaking alarms, and feeling beaten down at the regular issues, change your thinking to tackle the root causes and fix them, just like any bug or new feature.

You’ll become excited that you’re solving the issues.

By having this shared understanding in the dev team to always be resolving root cause of outages, including architecture restructures and rebuilds of components that take weeks or months, you’ll reduce these incidents dramatically.

Finally, by doing this, you share the pain with everyone else - product managers and business leads don’t get their features or other improvements as fast, they now see what you deal with, they’ll ask why things appear to have slowed down, and you can now say you need more resources.

I've seen this on many teams. There are several other options are you not looking at. One is changing the systems to the on-call becomes way less burdensome and much more automated. Not sure if this is an option or not. This isn't easy to implement (eg I've seen many engineers misunderstand the problem and focus on cool tech often, this isn't a blank check) but it's a great option and one I've seen get people promoted in the long term.

I've also seen teams where this festered and no one fixed it. I usually got called in in the end to fix it. Often the engineers weren't even talking to the managers about the issue and that's all a fix took, a solution that wasn't just more money or more people. It also helps if you can come up with a basic cost benefit analysis in terms of wasted dev time that could be used for something else. This is a language managers speak.

You should really consider and discuss with your manager several of the options in the comments: pay, sleep replacement time, more people on the loop, better automation, tech debt work that is focused on burning down the most common pages, etc. It's never a great idea to show up with only one possible fix, especially when that's "pay me more". They may not be able to, or not thin you are worth, and then your option is leave or deal. If you have quite a few more options maybe a compromise can be reached.

Engineers just suffering in silence and then quitting in anger is really the worst option tho. So open a dialog if you have not about other options.

As some others said, if you're not getting traction, also talk with your skip... you are meeting with your skip right? But don't come to them with problems and gripes. Come to them with possible solutions and get their advice on those solutions, and be open to their suggestions as well.

I'm probably piling on, but wanted to echo a lot of what's been said.

On-call responsibilities are supposed to be a two way street between an employee and an employer.

Employers expect employees to be on-call and handle production incidents quickly. That's good for the product.

The two way side of it is that employees must have the autonomy and time to fix the root causes of what's paging them to reduce toil.

This is the root of "you build it, you own it". "Own" means having autonomy.

That kind of engineering work does come at the expense of feature delivery. However, it's also good for the product.

Regarding getting paid more for going on-call, from your description, the issue doesn't sound like it's a financial one. If you received $X00 per week more, would that be an acceptable tradeoff for the constant anxiety of your phone paging you at any time or waking up at least once per night?

(source: am ex-PagerDuty and founded a company to help drive software ownership, so I've thought a lot about this)

I'm always surprised people are willing to take money over time. I hate on call and would easily take a pay cut to never have to do it. I don't have kids or crazy expenses so I guess that's easy to say but an extra 2-3k a year is not worth spending a significant fraction of your life tethered to your phone/computer. to each his own I suppose
The truth of the matter is that if you’re working on a web service, you should be responsible for it, at the very least during its initial days of launch, you should be ‘on call’. Same with any time there’s changed you’ve made that are going live.

The “throw it over the wall and let the SREa handle it” pattern is overflowing with anti-patterns.

Now, if you’re making non-internet products, feel free to do it any other way; but, if you’re managing a website or web service, you should be available for it.

Yeah I'm not saying that people shouldn't be responsible for the things they build just that there are a lot of people who will give up their time for usually some small amount of money (given what they make hourly) rather than to just take that time for what they want to. Where I work we have little leeway to fix issues and clients do dumb shit all the time and we have to deal with it. why would you give up your personal time for some marginal amount of money?
I've always expected on-call time to be part of my job. It's part of how I can be okay with working 4 hours one day and 10 hours the next day.

I'm salary and my compensation includes my on-call hours. That's how I've always seen it. Now, if there's a balance and if the work becomes excessive, then a conversation needs to be had and probably standards need to be looked at because something is very wrong at the company. On-call shouldn't regularly be demanding, almost everybody that's capable of it should be doing it in a given role, and so on.

When that's not true, it needs to be investigated because it's unsustainable.

First of all, on-call without the latitude to prioritize operational improvements (making the cause of the pages go away) over feature work is a non-starter.

To address the question in the post title, a team I was on was able to re-negotiate the on-call terms. Our team didn't have any operations to speak of (we just wrote software, and didn't build services) so we were lumped into a rotation for the org we were in. When the pager went off, not only did we not have any familiarity with the system, we didn't have permissions to do anything anyway. We just ended up having to page someone else for every little thing.

We ganged up on management, told them that we simply were not empowered to take any actions during shift to address issues or off shift to improve things, and got taken off that rotation.

Where I'm at now, if someone has a rough night or a couple of rough days, we'll trade part of the shift to give the person a break.

> There are some false positives, but most are not, and not easily fixed by more engineering.

This seems like the crux of the issue. It sounds like there is a long tail of issues that are hard to fix but have large customer impact. Or do they?

If these long tail issues didn't get fixed, how much revenue would it cost? Figuring that out seems key. If it's a lot of revenue, then it would make sense to spend the time to do the hard engineering fixes. If it's not a lot, then it makes sense to let you sleep.

> Management has extricated themselves completely.

This is a big issue too. If the problems warrant waking you up, they should be serious enough to involve management. If they aren't, then it sounds like they're waking you up for no reason.

Sounds like your company should hire people from different time zones and have the on-call follow the daylight so that teams can fulfill that responsibility during normal work hours, especially when expected response time is in single digit territory and on-call isn’t paid extra.

But overall it sounds like the company for which you work is a complete joke who doesn’t care about employee health and you should leave them asap.

Good luck! Don’t forget, engineers are high in demand across the globe!

On call being too eventful is a bug (arch, infra, code). The solution is to propose that every wake is responded to as something that must be prevented going forward.

The usual incident review and postmortem process can be applied. If they happen so often you can start with applying the process to some subset to start.

Firefighting is a waste of talented technical resources and results in good people leaving.

I'm not in development or engineering directly but work in an operations role (mostly Puppet and Terraform work) where I've been oncall for the majority of my career. One thing that is common when things get really bad is the mindset of oncall doesn't count towards my role at the company. Many people see it as the painful work of cleaning up while the others are out building the next big thing. So it's easy to see people jump into the shift, deal with the mass alerts, then leave without making any improvements for the next guy or gal.

One way we have been trying to improve this is working with PagerDuty reporting and looking at the total amount of interruptions (not just pages but anytime PagerDuty reminds you for an alert/expired snooze/escalation) with the team. It's very easy to forget the oncall as you leave, but having more eyes on the shifts starts to bring awareness and lots of "why is that still broken" questions that are better answered at 10am vs 3am on a Sunday. I came from a large Operation Center so I know the pain of bad alerts, mostly cya stuff where it was put in place just to make sure the last guy can't get blamed. Sort of like adding 100's of random smoke detectors in a build without any fire suppression. The intention is good but the results are poor.

Outside of the meeting with the team, we also have proper handoff meetings with off call and on call, so they can share what's going on verbally instead of tagging the next person with the alerts. Makes it easier to share what's going on, any weird problems, notes. Also we're not using a 24/7 oncall coverage but 12/5 and 48/2 for the weekends, it's a small change but helps so much. The worst I ran was a 7/24 at a major email company and was paged every three hours, for the entire week. After that I knew the team didn't want to change and I needed to do something about it.

I'd quit, there are plenty of companies that don't do that. Look into contracting maybe, I've never heard of a contractor being on call.

Sleep matters.

I wonder if maybe it's a FAANG thing that cheap startups try to copy?

We were on-call (a week every 3 months) with my last employer but it wasn't too bad and it was spread equally across people. It wasn't compensated but during the on-call you didn't do any product work, just improving monitoring and alerts so that being on-call didn't suck & recovering if something happened at a bad time.

Still, being on-call sucked because we had too many stupid monitors checking on trivial things that weren't important and that people were too afraid to touch.

A few other companies I've been at just have a dedicated infra team on-call, which gets paid more.

This sounds like a fair solution. I would've liked the extra money as a youngster and now I would gladly avoid messing up my limited sleep.

If the company you work for doesn't take your time seriously, it's time to look else where. I did what you're doing for a couple of years at a company that just didn't have the leadership it took to get things fixed. I was miserable every day. Partly because on-call sucked but more importantly because I was working on a project that felt like I was shoveling quicksand. Your time, your energy, and your self-esteem are worth too much to be shoveling quicksand for people who don't understand how to do things the right way.

I'm now at a company that takes these things seriously and I'm learning a ton of good habits and I feel really proud of the work that I'm doing.

personally, I just outright refuse to be on call. That cheap bastard needs to hire a SRE. I explicitly ask the question in interview, I let the manager know I have zero interest in doing such a thing. and when they force my hand, I go somewhere else.
Incidents are unplanned investments. They’re a decision by leadership and sap the company’s prospects. Certainly negotiate. Maybe it’s your only job? Maybe it’s a 50% pay boost? Ideally you can show the company a better way: keep records of incidents, make statistical measurements to set expectations and then change from the reactive to proactive stance. The idea is to make things sufficiently reliable _before_ the customer experiences the incident (not after!). However, on the margin, if you’re not being sufficiently compensated, you might need to find wiser leadership.
Left and wished I had done it earlier. Once I gave my notice, that's when they gave a counter offer but I had made up my mind.

Cost the company upcoming tenders and my entire team soon followed. Not something I'm proud of. But priority 1. should be personal wellbeing.

One thing I learned from this is, that I fail to convey severity of issues to certain type of managers. Framing the issue in dollars helps in those cases.

And if it is understood and still ignored. Welp, time to move on.

The teams I've been on have a rotation that is pretty fair, but unpaid. Most technical members are required to be on the list, with a few reasonable exceptions. If there's someone who isn't in the list, people will joke about them not being on the list and that seems to put pressure on adding them (this won't work in all company cultures). It's usually 1 week every 6-10 weeks depending on team size.
I'd make sure you discuss this with other engineers who are on-call. If everybody agrees, you could declare together that from now on, on-call is your only responsibility and you won't be taking up other engineering duties.

Of course management could fire you for that, but then they don't have anyone on-call anymore, and they're unlikely to find new engineers willing to add it as an unpaid responsibility.

One option for some is to go freelance and make it clear when you take on contracts that you're not available on call (the client's team or another freelancer you work with could step in for example), or at the least you can negotiate fair compensation for it.

Not the solution for everyone but people overlook the option of working for yourself.

What company would simply hire you freelance once you're already full time? It sounds like a bizarre notion to me but admittedly I'm relatively green (6 years in industry) and have only worked a handful of gigs.
I mean leave a salaried full-time position to work freelance for multiple clients (not the company you used to work for). Find your own gigs and clients where you have more freedom over how you choose to work.
I had to take a pay cut and switch to a purely service delivery / provisioning role before I got out of On-Call; same org, different role.

~7 years and a couple of jobs later -- totally worth it.

Just make sure you're using that newly found freedom wisely; don't fight for freedom and then do 3 hours of uninterrupted netflix every night.

Was in a team of around 12 engineers for a XXXM AUD dollar telco project a decade ago.

Customer (Telstra, Australia's national telecom) wanted us to sleep on site.

We were contractors, so we all agreed we would say yes, provided we were paid for 24 hours.

The customer decided they didn't need us to sleep on site.

I used to do on call where each time I got called out was a three hour charge. It's great when you're young and single but not any more. I would never do on-call for 'free'. My time is more valuable and I work to live.
You didn't specifically list where you're based, but worth checking local laws. Some places have requirements for on-call work and companies simply get around it because most employees don't know.
The union was called for help, but this is Sweden where a union is not something to be afraid from and employers actually talk to them with respect.
> Being able to sleep fully through the night is increasingly rare.

Are there are actual user-facing issues occurring every night? That sounds extremely bad and unusual.

What is the 'punishment' for not getting to a call on time?
> What is the 'punishment' for not getting to a call on time?

There is no formal one, just more of an implication.

If you miss one page, people will notice, and you might get asked about it the next day. If it's just the one, it'll probably end there. But if it ever turned into a pattern you can be sure that you're manager would bring it up in your 1:1s (despite them not being on any on-call rotation of course). It would likely not be a fireable offense, but it'd be made very clear that there'd be an expectation for you to improve.

I suspect that this is more or less how it works at a lot of companies with similar setups.

Is there no backup on-call person, or any kind of escalation path for pages?
At my company, if you don't check into an on-call ticket within 20 minutes your manager will get paged, who will then likely need to page one of your peers to actually fix the problem.

So it's not like you can just ignore it until morning.

Why doesn't the company hire someone in another time zone?
> Why doesn't the company hire someone in another time zone?

We have teams in other time zones, but we're sectioned off in such a way that every team manages their own operations, and members within a team tend to be clustered within similarly banded zones (for easier collaboration, etc.).

Yeah, no. They need to 'follow the sun' and have people who can handle things in each time zone around the world instead of waking people up in the middle of the night.

Track how much sleep you're losing and document how unsustainable this is to HR as you leave. Uncompensated, sleep-depriving on-call is not the norm nor should it be for a company of any decent size.

May be worth at least looking into contractors in another time zone to cover your overnight on-call. My team did this a couple years ago and it's significantly eased up our on-call rotation, at least to the point that we're only getting called for major outages while the offshore team handles the minor issues, change/release support and things that go bump in the night. There's still weeks where we put in significant time if there's a major issue, but we're not getting woken up for every little alert that goes off.

I should also add my team has an automatic comp day at the end of our rotation, though that's handled on a team-by-team basis. Plus it's not uncommon to get an extra comp day if you have a particularly brutal week or support some major off-hours work.

What's an SV company? I googled for SV companies and Software V companies and found nothing. Many companies seem to have SV in their names, though.
Presumably short for "Silicon Valley".
Silicon Valley
silicon valley
how is this not even obvious?

i never worked in a company that expected uncompensated on call

even those that had no official policy typically had lenient look-the-other-way approach so you can ,say,take half day off if you got called up at night

people gotta realize engineer/support churn is more expensive long term than giving people a fair deal

If you can - leave. There’s plenty of companies with paid on call or more lenient policies.
You can change things if you have power and demonstrate that it'll work. Otherwise, leave.
Sounds like a sweatshop. Just leave.
I have worked for 3 separate companies in my decade of software development where I have had to be on-call. One of these companies was an organization at a very large and prominent software company.

The different on-call rotations worked out thusly:

1. on-call was 1 month long. Response times had to be very short. During business hours, there was a large queue of long-tail work that needed to be resolved that was outside my normal work. Most of the employees here were in their 20s and 30s, probably.

2. Small company. Probably 30 devs total. I was on a team of 1, 2 and eventually 3 people. on-call was 24/7 for my team. Response time was about an hour. I was the youngest employee and most employees here were in their 40s or beyond.

3. Smallish company. < 500 employees. Dev team size of 6ish. On-call is a week-long venture. Turn around time is very short, I think 30 minutes? On-call is a dedicated period. Most issues can be resolved during business hours; but, emergencies are handled at all times.

For [2] and [3], there were unwritten patterns around how much you really needed to be at work once your shift was over if on-call was particularly bad.

At [1], the on-call was particularly long and harsh for a couple reasons. In the early days, I heard that the on-call was absolutely horrible. Logs were non-existent, errors were terrible and required a great deal of work. But, it caused developers to feel the pain of not logging properly, not handling errors correctly, and not monitoring usefully. Over time, those issues were resolved, the team has incredible logging and incredible tooling, knowing that they're going to be the ones that have to fix it this time.

At [2], the constant trouble of code prior to my time there caused the developers of the old code to make it more stable. The services eventually became auto-resolving, we had a network operations center (with appropriate work hours that covered the whole day) that had playbooks for all the remaining normal issues; and, the bad stuff made it to us. On-call 24/7 meant I might get called once every couple weeks or less by the end of my tenure there. I lived a normal life.

At [3], we're still learning and the code is in constant churn. Issues come up and we attempt to fix the root cause on most of the issues. Our logging has gradually improved and our monitoring has been improving and they're tweaked to find real issues.

--

My thoughts:

I think on-call is an important experience for developers. Developers should be first responders for their code when it hits production for the first day or two to catch any possible issue.

Developers should know the pain of deploying their change at noon or on a Friday at 5pm, or at 11pm on a Wednesday, so that they accept responsibility and importance if it breaks at those times, and those actions should be above and beyond their on-call rotation.

If the work of the on-call is especially intense, it should be a separate role that the developers take, with a rotation so that that's all that specific developer is working on.

Developers should write code and review code with debugging and tracing and monitoring and self-correction in mind, to reduce on-call pain - and one of the best ways to do that is to make them feel it, themselves.

If your code-base is having as many issues as you suggest, there are probably some common areas and pitfalls that the code has, and maybe they'll be patterns the team can implement each time those same issues come up. As a result, those errors won't come up as frequently.

If the monitors are too noisy with non-errors, then a couple things could be going on. Let's say that the code 500s when someone passes an invalid argument, or a record isn't found. Those probably shouldn't be 500s, so the code needs to be updated for them to not be. On the other hand, if there's a monitor checking for more than 5 401's in a minute, maybe that's a bit strict and should be changed to "more than 10 401s a minute, every minute for 10 minutes; OR more than 200 401s a minute" - that way you catch the big ugly case of "our auth service is down" and aren't caught by people failing to enter their password a bunch (but giving up).

If the code is an absolute and unfixable mess and you don't want to help fix it, if management is not interested in improving common pitfalls, then maybe it's time for you to look for another job.

Here's some additional reading: https://sre.google/sre-book/being-on-call/

Write better code that doesn't need support