Hacker News new | ask | show | jobs
by danpozmanter 2160 days ago
"Engineers should be on call for their own code." - Would you rather work someplace you are expected to be on call 24/7, or a company that doesn't require that?

It isn't the norm, and it isn't competitive. It's just more "always on" culture in the workplace - and that's not healthy. A company should understand workers need real breaks - and being on call is not a real break.

7 comments

Want to jump in here - I have worked at a company where engineers are not on call for their code, and it was a living nightmare.

_You_ might not be on call for your code, but _somebody_ will be. Often some poor SRE/ops person that has absolutely no idea what the app is doing/or why it's failing in production.

Not being on-call makes engineers complicit. I've seen it all, known memory leaks shipped into production, apps where half the endpoints couldn't even be compiled, code dumping the production redis at 1AM ... and every time the pain just felt on deaf ears.

If your code is what wakes you up in the middle of the night, you have: - Incentive to fix/mitigate as soon as possible. - No blame game to play. Either the error was made by you, or someone on your team. It doesn't have to go up 3 rungs on the ladder then back down again.

I don't think the author was suggesting that everyone should always be on call, just that you _must_ be responsible for your own code in production

I'm always happy to help some poor SRE in the middle of the night, and I once even drove to the office in a rainy Sunday, in the middle of my vacation, to access IP-restricted stuff because a support intern messaged me on Instagram.

...but with that said: I'm glad I only worked in countries where work is properly regulated and "on call" means "I'm getting fucking paid every cent for each hour I _must_ answer that goddamn phone". Which in practice means there's no PagerDuty.

The unpaid on-call culture is bullshit. The company can either pay me or go fuck itself.

I unfortunately work in a place where on-call is unpaid. I'm an SRE stuck in the 90s.

The policy states that only the Operations team gets paid on-call, because I guess in the old days they would be the expected to deal with production.

Fast forward to today, and the Operations folks are a small team managing 2 datacentres, and all on-call rotations between SREs and developers are considered unofficial and therefore not eligible to be paid.

One of our Sr. Managers tried to take this up the chain, but then got reprimanded for putting developers on-call.

Now apply the same rules to the SRE role as well.
I've found the problem isn't being on call, it's being on call PLUS not having control over priorities.

When doing it right, owning the on call experience can be a valuable learning experience. But this usually means having extra time to do things like develop integration and performance testing environments. And access to make whatever changes needed to happen, happen.

But a lot of places are like "nah, you're being a perfectionist". And then expect you to magically respond to issues with vague descriptions and no diagnostics. And yeah, that sucks.

When you're Oncall you aren't Oncall 24/7 forever. It usually rotates amongst the engineers in the team. I'm on a team with about 10 engineers in it so you're on call about a week every two months. I call that manageable.

Engineers should 100% be responsible for owning their code, and fixing any issues that arises from it. After all they're the ones that wrote it, aren't they the best people to fix it when it breaks?

I don't imagine Charity (article author) was implying that the Amazon method of being on-call for your code was ideal.

I took it to mean that code ownership is important and you should be responsible for fixing things when your code blows up

I would argue that using the right words is important here. “On call” typically refers to a specific thing, where if something breaks that person on call is, well, called (or paged, etc.) to fix it when it happens, even at 2am. This is how I took the meaning of that paragraph as well.

If the author meant that code ownership is important, or that the engineer(s) who wrote the code are responsible, that message could have been conveyed by saying, “the persons who wrote the code should test it once it’s deployed and are responsible for fixing if it breaks or is broken when deployed.” This is much clearer and doesn’t use terms that could be understood incorrectly.

Being on call sucks, clearly. But, the benefit of engineers oncall for their code is that it makes a pretty effective feedback loop --- the person who breaks the thing fixes the thing, and learns to break the thing less often, or to break it earlier in the business day so as not to ruin their evening, or to make it run better in degraded modes so it's ok to be broken for longer and alerts can be acknowledged amd dealt with later.

I like working in small teams because there's less required communication. Having the oncall be the engineer means the oncall doesn't have to communicate with the engineer, they're always up to date because it's one person (subject to sleep deprevation issues).

It's certainly not good for work/life balance though. Some production issues are unavoidable, automation for the common ones can help.

Edit to add: if you're oncall and your alerts are mostly because your dependencies are bad at their job, and you aren't empowered to do anything about that; having the engineer oncall isn't useful. It's only useful if the engineer is in a place to make changes to reduce future alerts.

And then that one person gets hit by a bus and you go out of business. Very-interconnected large-scale systems rarely have failure modes that are as simple as something the dev did/didn't do.
It seems like about half of the postmortems I've seen (public ones for high profile things and private ones where I've worked) have the incident start either when someone pushed a change, or sometime after the change was pushed when the change blew up; this is why change moratoriums are so effective --- when people stop messing with the system, it becomes stable.

Another large portion is power transfer switches failing. Then you have redundant cicso products failing to fail over properly often resulting in 30 seconds-5 minutes of network connectivity and then (if you're reading a postmortem) cascading failures. After that it's one off partial hardware failures where things worked enough to meet healthchecks but not enough to do actual work (my favorites are things like ECC is correcting errors at such a high rate that the system is using 90%+ cpu on servicing machine check exceptions or somehow system booted with 64MB of ram instead of 4 GB and is running from swap, miraculously)

You can obsess about bus factor, or you can hire people who are good at figuring out complex systems with no documentation and if someone leaves, assign someone with good overall system knowledge to their system until you can find a new dedicated person.

Arguing in favor of more than one person per project is not "obsessing" over bus factor lol. I want to be able to take days off, and I want my coworkers to enjoy the same.

The kind of takeaway I'd want to see from your first example is less like "don't do the things we know will cause breakage when we can't tolerate breakage" and more like "develop runtime-gating of new features and a way of sampling or shadowing production traffic onto n+1 builds before they are eligible to become the released build".

I've also had many issues with dodgy hardware of all types forever-circling repair queues in large fleets and never had a satisfying outcome for it either. Hopefully one of these days.

The whole point of putting engineers on call (in this context) is to encourage them to make good technology choices and to take some ownership for their products. If there was no counter-pressure with pager duty or other threats to peaceful existence, then most developers would just pick whatever technology they personally enjoy using the most and expect that someone else will fix their special shitpile for them at 3am. Someone is always going to get screwed in this equation, at least make it an equitable screwing.

Being on-call doesn't just apply to code either. Would you be OK if no one tried to fix your broken water pipes or electricity until the following business day? Do we turn off the global internet at bed time?

At some point people are going to have to do shitty work to keep this world running. The best you can do is rotate the shitty work around so that everyone can help out. Automate what you can, share the load for what you cannot. If everyone does their part, it is a lot less painful all around.

There's more going on and worth exploring if engineers are so unattached to outcomes they pick technology in a vacuum. Pager duty is a heavy stick - it's important not to avoid root cause analysis. So if your engineers are making bad choices - what's really going on?

That assumes engineers are even empowered to make technology choices. At many companies they are not (whether by dint of organizational structure or the roadmap not allowing a major technology shift from whatever "shitpile" you and your team have inherited).

Having clear escalation strategies (and knowing when escalation to the original engineers behind a project is even appropriate) is often lacking. I wouldn't want to call engineers in at 3am for a problem that can be fixed by following a documented devops process. Plus - what happens when the engineer you need to reach is unavailable? They are sick, or don't wake up, or their phone died?

What happens when business pressure says "we're ok with calling engineers twice a week as long as the roadmap moves"?

"You built it you're on call" is a fragile way to handle problems in more ways than one.

Which isn't to say there shouldn't be shared responsibility. Of course there should. But responsibility without power is toxic. At the very least it increases flight risk - but in practice often has a far wider reaching deleterious effect than just that.

> I wouldn't want to call engineers in at 3am for a problem that can be fixed by following a documented devops process.

Why would there be a process that could be executed that wouldn't already be automated? If the ops guy is dealing with an issue, it's because all the known remediations have failed.

Containers already have auto restart on failed health checks. VMs have vmotion and HA for failed hardware. If the ops guy is up at 3am, dealing with a service you wrote, chances are high that you (or your team) should be involved for the quickest resolution

A documented process can be automated.

Humans should be paged only when this is a new category of failure, and in that case, having the developer wake up first triggers a really good feedback loop.

I think the model that I prefer is: "if you're going to deploy some code in the evenings or beginning of the weekend, you're on call for the code you added until work day tomorrow"

And have it be only around the code you just added.

If you deploy (and validate) earlier in the day or not on Friday ( or whatever the end-of-week day is) then that requirement is gone.

I've definitely seen code get deployed at 8pm on a Friday night; and, who knows if it was guaranteed to work. That person definitely should be on call for it.