Hacker News new | ask | show | jobs
by ublaze 1993 days ago
We have a pretty tight oncall (5 min response time).

I think the steps you can take are:

1. Make it clear to your manager this is unacceptable, and you will end up looking for alternate teams/jobs if this goes on

2. Make the same thing clear to your skip level

3. Quit / change teams, citing oncall as the issue

There's no point of doing anything else, in my experience. It's someone else's job to make sure that your oncall experience is prioritized. It sucks to leave an otherwise good job.

For extra credits - try to propose some solutions. Why are some issues not solvable by engineering? Would simply resetting expectations mitigate the largest issues/waking up at night?

3 comments

Thanks!

> Make it clear to your manager this is unacceptable, and you will end up looking for alternate teams/jobs if this goes on

I'm trying to do this in as harmonious way as I possibly can, but I'm a bit worried that getting really contentious about it might have negative repercussions. It's possible that I'd "win" and allowances would be made, but it's also possible I'd end up making some real enemies and/or put on a track out the door.

One hopefully-unusual circumstance here is that most of the rest of my team (and in fact the company) either don't mind the situation much, or at least aren't openly vocal about it, which makes me look like that one nail hanging out that's ready to be slammed back down.

> Quit / change teams, citing oncall as the issue

This is probably the inevitable solution unfortunately, although I will feel bad exiting (making the rotation even smaller) and without having moved anything in the right direction.

> Why are some issues not solvable by engineering? Would simply resetting expectations mitigate the largest issues/waking up at night?

Yeah, agreed. This is the obvious way out if at all possible, but there are many types of alarms where it's fairly difficult. For example: (1) cases where there is a big problem and we get paged essentially as a side effect of one failure causing issues in our part of the system, or (2) catch-all alarms designed to page when something looks suspicious enough to merit human attention, even if not a known failure case. There's a strong attitude of err-on-potential-issues, so relaxing any of these tends to be a no-go politically.

> I will feel bad exiting (making the rotation even smaller) and without having moved anything in the right direction.

FWIW, I've quit jobs on short notice because of poor conditions like this, and my leaving increased pressure on those who were still there. These are good things to keep in mind:

- They are free to resign too.

- Their predicament is entirely the fault of the employer, not you.

- Employees are often willing to soldier on out of a sense of duty to their coworkers, which gives the employer no incentive to change. To the company, it's a case of "if it ain't broke, don't fix it".

A former employer of mine once had all developers working 60 hour weeks because "this is what it takes to be competitive in the industry". The staff grumbled and complained, but it wasn't until there was a mass exodus of senior developers that they suddenly discovered the value of happy employees. That company is actually quite a nice place to work now. Some executives are incapable of seeing the error of their ways without real consequences.

There will likely always be someone willing to fill any position regardless of how abusive it is. There’s no reason to suffer because other people have decided they’re going to suffer at a miserable job. There’s also generally a huge pool of talent that management could find a way to accept for a position, if they’ve gotten the correct incentives.
They can definitely find replacements, but they're still hurt by the knowledge that walks out the door with their former employees. My previous employer had a large decade-old (at the time) codebase that was only well understood by people who had been there since the beginning. Losing all those SMEs was a painful blow, which was why they finally changed their policies.
When they have alerts that tightened down and want them responded to that way - I suspect they’re institutionally or personally paranoid. I’ve seen this happen when your department is the object of derision from other departments. Ie when IT is treated as a cost sector. This is a huge red light and a reason to run for the hills. Perhaps the nice option is it’s just your immediate management who’s overreacting and switching teams can make it better. But I file this combination as a grave condition to avoid at all costs
Get a course on monitoring, so that you can use that to speak with authority and tell others that they are in fact incompetent. Spoiler: case (2) is a big no-no for paging alarms. And "err-on-potential-issues" is a synonym for "crying wolf", which is an antipattern. All paging alerts must be actionable and have a playbook, period, and no exceptions.
> I'm trying to do this in as harmonious way as I possibly can, but I'm a bit worried that getting really contentious about it might have negative repercussions.

You mentioned "contentious". Are you concerned that the conversation can't remain friendly, professional, and cordial for some reason?

On feeling bad because of exiting: You're giving management one more data point that this situation doesn't make sense.

Engineering manager don't care about complaints you're making while smiling and trying to be nice, they care about employees churn rate.

Hopefully things will change for the good in the future.

We have a pretty tight oncall (5 min response time).

A previous employer wanted to drop the response time from 15 to 5 minutes. That was the straw that broke the camel’s back and everyone refused to do on-call until we got new contracts which paid for on-call quite generously. Management pissed and whined of course but a year on, the on-call payments were dwarfed by the savings made.

> pretty tight oncall (5 min response time)

How long is each oncall shift?

Each shift is a week long.

We're fortunate that our team has a European counterpart, so we don't have to respond at night. We do 9:30 am -> 9:30 pm, and there's 5 members in our rotation.

If your expected response time is of the order of 5 minutes, then you are not "on-call", you are working 12 hour days and your compensation and time off arrangements should reflect that.

I suspect that if the company is currently getting that amount of extra work (over and above a normal length working day) for free, then you're unlikely to be able to get them to change that. If it was me, I'd be looking for a role in another team or company that has a more realistic approach to on-call.

Any potential extra impact on your current colleagues that you leaving might cause is the responsibility of your management and up to them to mitigate. How your current colleagues decide to react to the on-call situation should be up to them.

Good luck resolving this, I've been in work situations that had unreasonable expectations myself and I appreciate how stressful it can be.

>If your expected response time is of the order of 5 minutes, then you are not "on-call", you are working 12 hour days

I'm a new dev at a fairly young startup. We have recently started an oncall process and we have similar response times for oncall though our workload isn't nearly as heavy since our scale is low. What's the standard in oncall response times/expectations?

I don't think there's any real standard, since it very much depends on application SLAs, industry sector, size of the on-call team, geographic location, length of on-call rotation, frequency of call-out, how realistic the management are, how much inconvenience the team members are willing to put up with, etc.

For example, I work in London and it would be unreasonable to expect that someone could travel between home and work on public transport and still meet a response SLA less than one hour. That would likely be a different length of time in another location, or if people worked 100% remotely, for example.

My opinion is that if you have a response time less than say 30 minutes, then you actually need to be compensating people for sitting in front of their computers ready to respond immediately, whether that be in the office or remotely.

Unless call-outs are very frequent (in which case there are underlying reliability, capacity management, and/or alerting issues which need to be resolved), then on-call isn't really about the extra time spent working, but the restrictions on what one can do whilst on-call.

To use a fairly simple metric: if an on-call SLA means that I have to be concerned about whether I can pop out to a local shop or how long I can spend in the shower, then I don't think that I would be on-call, I would be working.

Of course start up environments (especially early stage) are always different from more corporate environments and there are generally greater resource constraints in general. For a start up I am usually looking more at what valuable experience I can gain, rather than maximising remuneration (subject to a certain base-level of course).

However ultimately the question remains the same: do I think that what I am getting out of this role is worth what I have to put into it? There are probably roles in which I'd be willing to put up with the inconvenience of very short on-call SLAs, because either they paid very well, or I was gaining very valuable experience.

Whether a role fulfils ones own expectations for the reward/expenditure ratio is a question that everyone has to decide for themselves.

Wait, so you have to be available within 5 minutes between 9:30AM -> 9:30PM for a whole week?

I hope you are getting paid a lot. What happens if you get paged while taking a shit? Do you get reprimanded if you can’t get off the toilet in under 5 minutes?