Hacker News new | ask | show | jobs
by nrr 3807 days ago
… whereas I'd conversely argue that volumes are also spoken when a company does not require its developers to be on-call. Often, those volumes are written with negative undertones and a narrative that speaks to developers not owning or understanding their software stack. Worse, in an exercise reminiscent of companies offering unlimited PTO, "not on-call," when read between the lines, could really mean "always on-call." I currently work with teams that operate this way, and the amount of burnout is staggering.

As someone who has leapt between development and operations time and time again, I can anecdotally say that the systems where developers felt like they had ownership stake (including a desire to know "what went wrong," often only accurately conveyed via an OC shift) were better designed, better maintained, had longer MTBF and had dramatically shorter MTBR.

That doesn't mean that the development team that engineered the system in the first place should always be the first line of defense; rather, it means that there should still be some sort of escalation tier in case the operations team to which the service has been handed off needs some further assistance. I can recount stories for days wherein I was frustrated as someone in operations that I couldn't reach someone from a development team in the middle of the night for a critical service--where VP- and SVP-levels were barking at me to fix it--only to hijack repository permissions and write patches myself for their software. It should go without saying, naturally, that reprimands for such heroic acts (heaven forbid I actually fix the damn thing) were definitely forthcoming the next morning.

That whole mess is wrong and anti-collaborative. Without the guardrails of a well-defined on-call shift, this is what no-on-call organizations devolve to.

This all comes with a caveat though: Again, anecdotally, I've found that tossing the development team to the fire for the first six or so months of a service's production lifetime, before even allowing them to ask for operations handoff, pays dividends in terms of meeting (or exceeding) business goals, keeping operations' frustrations low, and delivering a quality service that other teams can rely on. This goes in concert with automated unit/functional/whatever testing, knocking parts over in production in a controlled manner, continuous reviews of documentation, commit-gating code reviews, monitoring that makes sense, and so on and so forth.

(As an aside, folks on the operations side have enough to worry about in terms of integrating all of the infrastructure to make everything appear to Just Work™. Adding the burden of having to reverse engineer tossed-over-the-wall "It's an ops problem now" garbage is akin to the trend in the initial days of the devops movement for operations folks to toss systems automation over the wall to developers. It's disrespectful. Work together. Show trust and solidarity by carrying a pager alongside the ops guys to say, "Yes, I'm right here with you in case you need me.")

Tom Limoncelli somewhat recently put out "The Practice of Cloud System Administration"[0], and I implore that you give it a cover-to-cover read. Even folks who align with the development side of the house will get some benefit from it.

Finally, to your point about testing in a non-production environment: Even with a barrage of testing in staging or QA, you will still find problems in your software that only exist in production, and it won't be for lack of trying to unify fiddly things like configuration parameters or runtime versions.

0: http://the-cloud-book.com/

4 comments

> Worse, in an exercise reminiscent of companies offering unlimited PTO, "not on-call," when read between the lines, could really mean "always on-call." I currently work with teams that operate this way, and the amount of burnout is staggering.

I've found the best way to screen for those sorts of problems is to ask everyone in the interview about their personal habits, instead of about process. Don't ask about vacation policy; ask when's the last time your interviewer went on vacation. Don't ask about the on-call rotation, ask about what issues came up outside work hours in the last month.

I don't disagree with anything you said here. Organizations should try to hire developers that care about what they build and understand that, as an implied part of the nature of the job, your phone may ring at some unknown time because of an issue. In operations it is certainly implied that you will likely be the first one notified of production issues and you will most likely be the first to know which developer(s) need to be contacted. I am certainly not advocating for a developer to throw their hands up and say "not my problem, I'm not on call" or "it's an ops issue." Those would be very junior or childish reactions and certainly not the qualities of a senior developer.

What my post was getting at more of an observation I have made during interviews where teams had on-call rotations for developers. When I ask "how often does your phone ring during your time on-call" I would get answers that hinted at a deeper issue. Maybe it wasn't always that way, maybe the on-call rotation started with the purest of intentions like you have highlighted, but somewhere along the line management saw that as an opportunity to take shortcuts with testing and quality. So, am I saying that if a company were to make me an excellent offer, but required on-call rotations are they automatically ruled out? No. I am, however, going to be asking some very pointed questions and probing that arrangement quite a bit.

Your points are valid. Stronger than that, I stand more to your view than the parents. Although I certainly understand the desire of the parent, I've been on both the Ops and Dev side of that fence, and on call REALLY SUCKS (I almost feel dirty in this post agreeing with a statement of "I should be on call" but it is what it is :) ) but it sucks worse if as you say it's entirely on the backs of the ops who are also handling sysadmin style work, and are kept more isolated from dev as is typical in "throw it over the wall" shops. (In writing this I realize I want to clarify that I have other thoughts entirely about having ops specialists instead of suggesting a merge to totally unified devops, but that's an entirely other discussion). Also as you say, ownership/responsibility/etc is also an ancillary benefit.

What I actually wanted to say from all this however, is that despite the truth of your argument, I'd take a third angle. No on call is bad, but similarly, _unpaid_ OT is bad. The common rhetoric from this soapbox is that if employers are accountable for this time, it'll incentivise them to have processes and hold values that don't abuse an expensive resource. If something is on fire; the on call employee still _will be there to handle it_, but the fact that it's become a pro-bono assumed part of many of our jobs in the tech sector is the part I take issue with more than doing the task itself.

This. tl;dr: No on-call schedule sometimes means always on-call.