Hacker News new | ask | show | jobs
by jrockway 1991 days ago
At big companies, it should be possible to have enough engineering offices to staff oncall with normal 8 hour days. (8 hours in Tokyo, 8 hours in London, 8 hours in San Francisco, or similar.)

At startups, it's harder; you simply can't have 3 dev teams on 3 continents, so someone is going to have to be around at night. The balance we found at the company I work at is that you are oncall Thursday-Thursday and get Friday off. (Not "not oncall", but "don't show up".) This seems fair to me; the free vacation day is really nice! (We're hiring! https://pachyderm.io/careers/)

When I was at Google, I worked on Fiber and we didn't have a dev presence around the world, so we had to be oncall after hours. We had a dedicated operations team with people paid to be at work during strange hours, as you'd expect from an ISP, but some issues were escalated to the dev team, and we had to be around for those. I was also the TL for a monitoring system that informed operations of outages, so my team would need to be around to handle monitoring monitoring ;) We just got paid extra for every hour we were oncall, I remember it being something like $1600 per week, but I forget the exact number. I was happy with this arrangement. Other people weren't, and weren't asked to be oncall, and it didn't count against them in any way. It all seemed fair to me.

2 comments

> At big companies, it should be possible to have enough engineering offices to staff oncall with normal 8 hour days. (8 hours in Tokyo, 8 hours in London, 8 hours in San Francisco, or similar.)

That arrangement is commonly known as "follow the sun support" (not particularly at you, it's just a good piece of jargon to know).

It comes with it's own set of issues. I work on a team that does follow the sun support, and while it's great for handling ops issues, it makes dev work much harder. We're not a large team, so it evens out to 2 people per region (one of which is always "on-call" and can't do dev work). The communication costs from time zones are real, and it makes everyone's context on what is going on different because they see updates from different regions.

> I remember it being something like $1600 per week, but I forget the exact number

No wonder, that's a pretty generous on-call stipend. I've worked places that paid for on-call, but never that well. It was usually more of a token amount, like $200 or $400 for the week. I.e. far less than it would be if you were paid your hourly wage (averaged from salary).

Overall, I think "follow the sun" is a great idea for teams that are generally not forward looking. It's hard to communicate on forward looking projects, but it's easy to hand over operational issues. I would absolutely do it for a NOC-type team, but I would have to think about doing it for a dev team that needs to handle after-hours issues.

Usually follow the sun support doesn't mean devs around the world working on the same things. That would be harder to coordinate for sure. A big enough company is probably better off hiring SREs instead of having dev teams where every meeting time is bad for someone. And a small enough company is probably better off outsourcing 24 hour coverage.

Why can't the on call people do dev work though? Having someone on call and on deadline isn't realistic. But there should be time between on call issues. And in most code bases there are things someone can work on without coordinating with the rest of the team every day.

Google does hire SREs actually. It doesn't mean dev teams aren't on call also. But dev teams do get paged much less frequently.
> The balance we found at the company I work at is that you are oncall Thursday-Thursday and get Friday off

If I'm reading this correctly, the expectation is that the on-call person doesn't sleep for a week, and that a single day off work is fair compensation.

I think you're being pretty disingenuous here. It is very unusual for anything to come up overnight and being paged is the last resort. If there was a sleepless night, we'd make sure you weren't oncall anymore that week. Oncall is about being available to be called within a certain amount of time, not being awake to keep an eye on things. Software keeps an eye on things.

Nobody thinks a 24 hour oncall rotation is optimal, which is why companies distribute themselves throughout the world and simply have people at work somewhere 24 hours a day. But even at companies like Google, it's not always possible. You have to balance working on a small team and moving quickly versus having triple dev-team redundancy.

Some other workarounds are:

1) Hire someone to be awake during off hours. They won't be around with the rest of the team, so probably won't have the same understanding of the service that they are responsible for supporting. Personally, I've never seen this work well -- both teams see each other as "out of sight, out of mind" and don't really help each other.

2) Ignore all issues between 5PM and 9AM. This is quite possible to do, and might be the right thing for certain companies.

3) Hope nothing bad happens, and when something bad does happen, call everyone on the team frantically hoping someone will be awake and answer your call.

Like I said, I'm happy with the balance I have at work. I think it gives our customers the confidence they need to trust us, while giving engineers a decent work-life balance. I'm just some random engineer; I didn't start this company or force this upon others. I chose it for myself.

I shared my experience because I think it's relatively unique (with the OP's experience of mandatory uncompensated work the norm), and I like it.

> It is very unusual for anything to come up overnight and being paged is the last resort.

This does not jive with my experience. Most companies aren't Google as you've described it, and in most cases the person on pager duty is the first human examining the incident.

> If there was a sleepless night

How about if they get woken up each night for an alarm that turns out to not be a big deal? That is the typical on-call experience, getting woken up for 15-30 minutes each night, cortisol from 0 to 100 in the 15 seconds it takes to get into Work Mode.

I guess that doesn't qualify as "sleepless" but between it and the general stress of not being able to turn the phone on silent, I'd call it "shit sleep." Nobody should be subject to it. How can you expect somebody to produce decent software in this condition?

That is the typical on-call experience, getting woken up for 15-30 minutes each night

There is no "typical on-call experience". Some teams have an on-call rotation that goes a year without being used. Some oncalls get paged once a week and it's a serious issue that will take an hour or two to resolve. Some oncalls are impossible to handle, with alerts every few hours.

How does your experience at some other company help understand this guy's company. Makes no sense as a response.

This is like when I said I once had unlimited vacation and I took eight weeks off a year for three years and people were like "That's not my experience". Okay, well, sucks for you. No one can do anything with that.

> That is the typical on-call experience, getting woken up for 15-30 minutes each night, cortisol from 0 to 100 in the 15 seconds it takes to get into Work Mode.

The only company I ever had that happen with was the big company. The other two companies I've worked with that had on-call experiences, if anything like that happened, we would be tweaking alarm levels so it didn't happen anymore.

If you're not tweaking alarm levels or fixing code to clear out false alarms, it's not a sustainable on-call rotation and that needs to be fixed immediately.

I've been the solitary on-call for the main service of a company before and I almost never got called because 1) we had good KB articles for the operations center for when things did break; and, 2) things very rarely broke in a way that wasn't automatically fixable

It's amazing how many cases "remove broken machine from pool automatically and then restart service and bring that machine back on service crash" is a valid fix for the weird, extra edge case junk that would otherwise be a call.

I've experienced this at small startups and BigCorps. Granted, at the BigCorps fewer things blew up in general, and when they did, it was interesting.
> How about if they get woken up each night for an alarm that turns out to not be a big deal?

Write a post-mortem on this "crying wolf" fact. It is definitely a bug in your alerting rules, so actions have to be taken, otherwise others will routinely ignore important alerts.

On my teams, we've come up with this: If you get woken up in the middle of the night for an actual incident (longer than 15 minutes after 10:00 PM), you can take the following morning off (half day) to catch up on your rest. If the incident takes longer than an hour to resolve (after 10pm before 7 am), you can take the following day off once you have your paperwork in order.