Hacker News new | ask | show | jobs
by jeletonskelly 3810 days ago
To software developers in this thread who are on-call; I'd like to share some thoughts with you. I've worked at places that do have on-call rotations and others that have none. I will no longer work at a company that requires me to be on-call.

Why? It says a lot when a company doesn't put the effort into various forms of testing and QA to ensure that production software does not have critical issues that warrant at 2am call. Unit, functional, integration, load, and simulation tests should be written for every single piece of critical infrastructure. You should be hammering these things in staging environments with 10-100x of your normal peak load.Use something like Gore to replay live traffic against a version in staging or QA environments. Yes, that takes work, but to me it's better upfront than to wake me up in the middle of the night or to know that when I go home I have to have my phone around me at all times. The business should care about these things too; it's their product and they should care enough about you to make sure good processes are in place to ensure quality production software.

That said, when I was at non-on-call companies there are definitely times when something does happen that warrants immediate attention. Generally someone in operations would get the first call, they check logs, diagnose the issue, and call a developer familiar with the app that's causing the issues if that's the case. I don't mind waking up because I know it has to be something serious that slipped past our processes.

4 comments

Minimizing 2am issues, and maintaining an on-call rotation, aren't contradictory. There's no substitute for having someone on call; but you can minimize the number of times they actually get called. This topic is near and dear to my heart.

You talk about testing - that's one side of the coin; the other side is careful alert tuning, (a) to minimize false positives at 2 AM, (b) to catch incipient issues while it's still business hours. (It's useful to think of alerts as just another phase of QA - the one that occurs after you hit production. The sooner you notice a problem, the less damage it causes, to both your customers and your sleep schedule.)

At my workplace, we run a fairly complex system, but we've been able to keep nighttime pager incidents down to I think less than once per quarter, including false alarms. I can't remember the last one. The QA effort isn't overwhelming, either. See http://blog.scalyr.com/2014/08/99-99-uptime-9-5-schedule/ if you're interested.

honest question: have you ever managed to work on a company that's either successful or growing very fast and can guarantee "various forms of testing and QA to ensure that production software does not have critical issues that warrant at 2am call"?

I can imagine that being possible in consultancies or small scale/load products, but honestly never seen it on anything larger than that - including environments with a mind-blowing number of layers of QA and tests...

I've worked at a late-stage startup that serves billions of requests per day at peak load, and there was no on-call for developers. I believe the ops team did have an on-call, but that was more at the infrastructure level, as everything was self-hosted at colos.

This was done by having a simple and resilient serving architecture. Every server was stateless. In addition, all the complicated logic was pre-computed into immutable lookup tables, offline. So if that task fails, it doesn't cause downtime, and can wait until the next workday.

We had a robust QA process, but it was far from stultifying.

Absolutely - not everything is a service (yet). It's not sexy, but if you only ship your bits every month or every two weeks then you shouldn't have 2am calls. Everything can wait til tomorrow (unless tomorrow you're shipping, in which case I have seen people still awake at 2am, but I consider that a failure).
I work at a successful avionics company writing software that goes onto airplanes. We do not have critical issues that warrant 2am calls.

I'm in the camp of "will not work at companies requiring on-call or pager-duty."

… whereas I'd conversely argue that volumes are also spoken when a company does not require its developers to be on-call. Often, those volumes are written with negative undertones and a narrative that speaks to developers not owning or understanding their software stack. Worse, in an exercise reminiscent of companies offering unlimited PTO, "not on-call," when read between the lines, could really mean "always on-call." I currently work with teams that operate this way, and the amount of burnout is staggering.

As someone who has leapt between development and operations time and time again, I can anecdotally say that the systems where developers felt like they had ownership stake (including a desire to know "what went wrong," often only accurately conveyed via an OC shift) were better designed, better maintained, had longer MTBF and had dramatically shorter MTBR.

That doesn't mean that the development team that engineered the system in the first place should always be the first line of defense; rather, it means that there should still be some sort of escalation tier in case the operations team to which the service has been handed off needs some further assistance. I can recount stories for days wherein I was frustrated as someone in operations that I couldn't reach someone from a development team in the middle of the night for a critical service--where VP- and SVP-levels were barking at me to fix it--only to hijack repository permissions and write patches myself for their software. It should go without saying, naturally, that reprimands for such heroic acts (heaven forbid I actually fix the damn thing) were definitely forthcoming the next morning.

That whole mess is wrong and anti-collaborative. Without the guardrails of a well-defined on-call shift, this is what no-on-call organizations devolve to.

This all comes with a caveat though: Again, anecdotally, I've found that tossing the development team to the fire for the first six or so months of a service's production lifetime, before even allowing them to ask for operations handoff, pays dividends in terms of meeting (or exceeding) business goals, keeping operations' frustrations low, and delivering a quality service that other teams can rely on. This goes in concert with automated unit/functional/whatever testing, knocking parts over in production in a controlled manner, continuous reviews of documentation, commit-gating code reviews, monitoring that makes sense, and so on and so forth.

(As an aside, folks on the operations side have enough to worry about in terms of integrating all of the infrastructure to make everything appear to Just Work™. Adding the burden of having to reverse engineer tossed-over-the-wall "It's an ops problem now" garbage is akin to the trend in the initial days of the devops movement for operations folks to toss systems automation over the wall to developers. It's disrespectful. Work together. Show trust and solidarity by carrying a pager alongside the ops guys to say, "Yes, I'm right here with you in case you need me.")

Tom Limoncelli somewhat recently put out "The Practice of Cloud System Administration"[0], and I implore that you give it a cover-to-cover read. Even folks who align with the development side of the house will get some benefit from it.

Finally, to your point about testing in a non-production environment: Even with a barrage of testing in staging or QA, you will still find problems in your software that only exist in production, and it won't be for lack of trying to unify fiddly things like configuration parameters or runtime versions.

0: http://the-cloud-book.com/

> Worse, in an exercise reminiscent of companies offering unlimited PTO, "not on-call," when read between the lines, could really mean "always on-call." I currently work with teams that operate this way, and the amount of burnout is staggering.

I've found the best way to screen for those sorts of problems is to ask everyone in the interview about their personal habits, instead of about process. Don't ask about vacation policy; ask when's the last time your interviewer went on vacation. Don't ask about the on-call rotation, ask about what issues came up outside work hours in the last month.

I don't disagree with anything you said here. Organizations should try to hire developers that care about what they build and understand that, as an implied part of the nature of the job, your phone may ring at some unknown time because of an issue. In operations it is certainly implied that you will likely be the first one notified of production issues and you will most likely be the first to know which developer(s) need to be contacted. I am certainly not advocating for a developer to throw their hands up and say "not my problem, I'm not on call" or "it's an ops issue." Those would be very junior or childish reactions and certainly not the qualities of a senior developer.

What my post was getting at more of an observation I have made during interviews where teams had on-call rotations for developers. When I ask "how often does your phone ring during your time on-call" I would get answers that hinted at a deeper issue. Maybe it wasn't always that way, maybe the on-call rotation started with the purest of intentions like you have highlighted, but somewhere along the line management saw that as an opportunity to take shortcuts with testing and quality. So, am I saying that if a company were to make me an excellent offer, but required on-call rotations are they automatically ruled out? No. I am, however, going to be asking some very pointed questions and probing that arrangement quite a bit.

Your points are valid. Stronger than that, I stand more to your view than the parents. Although I certainly understand the desire of the parent, I've been on both the Ops and Dev side of that fence, and on call REALLY SUCKS (I almost feel dirty in this post agreeing with a statement of "I should be on call" but it is what it is :) ) but it sucks worse if as you say it's entirely on the backs of the ops who are also handling sysadmin style work, and are kept more isolated from dev as is typical in "throw it over the wall" shops. (In writing this I realize I want to clarify that I have other thoughts entirely about having ops specialists instead of suggesting a merge to totally unified devops, but that's an entirely other discussion). Also as you say, ownership/responsibility/etc is also an ancillary benefit.

What I actually wanted to say from all this however, is that despite the truth of your argument, I'd take a third angle. No on call is bad, but similarly, _unpaid_ OT is bad. The common rhetoric from this soapbox is that if employers are accountable for this time, it'll incentivise them to have processes and hold values that don't abuse an expensive resource. If something is on fire; the on call employee still _will be there to handle it_, but the fact that it's become a pro-bono assumed part of many of our jobs in the tech sector is the part I take issue with more than doing the task itself.

This. tl;dr: No on-call schedule sometimes means always on-call.
> Use something like Gore to replay live traffic against a version in staging or QA environments.

What is Gore?