Hacker News new | ask | show | jobs
by jmsduran 4149 days ago
It seems, especially for major corporations, that on-call/pager duty is quickly becoming the norm for software development teams. I do agree that pager duty is a symptom of a fundamental flaw within the system/architecture. I think it would be in a company's best interest to devote time in improving the reliability and stability of their infrastructure, instead of relying on the band-aid approach that pager duty seems to be.

Regarding #8 though, when you are pressured to resolve a complex issue within a short time window, it can absolutely induce a sense of panic for those who do not handle stress well. In my opinion, I believe the remedy for this would be to have two individuals designated as on-call at a time, assuming the team is large enough.

3 comments

> It seems, especially for major corporations, that on-call/pager duty is quickly becoming the norm for software development teams. I do agree that pager duty is a symptom of a fundamental flaw within the system/architecture. I think it would be in a company's best interest to devote time in improving the reliability and stability of their infrastructure, instead of relying on the band-aid approach that pager duty seems to be.

I can't see there ever being a time where there is no on-call requirement. You always need someone standing by in case of some terrible disaster that cannot be handled automatically. Better to have this a formal responsibility that never gets used, then to not have it and end up with an extended downtime because you can't contact anyone.

That being said, if you're getting paged continuously during on-call, then there's a bigger problem that needs to be resolved.

> You always need someone standing by in case of some terrible disaster that cannot be handled automatically.

If it's a really terrible disaster, a once-a-decade kind of thing where everything goes haywire and you need as many staff as possible to get online ASAP, then yes. But aren't we talking more about the kinds of "disasters" that happen once a month or so, and can be handled by a few staff (not waking up the whole team). To me that sounds more like just staffing for normal operations.

At large engineering companies this is typically handled via literally having someone standing by, i.e. formally on duty, rather than having off-duty employees be on pager duty. There'll be at least a bare-bones staff on the after-hours shift (probably not in all offices, but in some kind of 24/7 operations center), enough of a staff that reasonably foreseeable things can be handled. Of course there are some pros and cons to that from an employee perspective. On the one hand the night shift isn't that pleasant, but on the other hand your responsibilities are at least formally limited to 40 hours/wk; if you're on night shift one week, you don't come in during the day, or carry a pager during the day.

> and can be handled by a few staff (not waking up the whole team).

That's what this is though. With every setup I've seen there's a rotation of primary and secondary pagers for each team. When something breaks the primary is paged, if they don't answer within a few minutes the secondary is paged. If they need outside help they can page an individual person by name or just a team. e.g. I need help from a DBA, I page the DBA team and the primary is paged.

If you have 4-5 incidents a month this gives you a team available to handle any overnight issues without having to hire a bunch of people to twiddle their thumbs 90% of the time.

That seems pretty wasteful if emergencies are rare.

We have three people on-call on my team, and we typically have an issue at most once a month - and so far, in 95% of cases, the issue can be resolved by killing an errant ec2 instance and waiting for its replacement to spin up in 5 minutes.

It would be much more disruptive and annoying if I had to work the graveyard shift even once every two months or so; aside from shifting my sleep schedule once every two months, it would be a week where I would probably be fairly unproductive.

This seems like a very naive response. We run on hardware that's lifetime is quantified not whether it will fail, but when it will fail. You don't know when that is, or how it will fail. The node could completely go away, or degrade enough that it begins to impact performance.

We also run persistent systems across the WAN. And, unfortunately, some of these things require the state to be maintained.

You can't just design these systems to be "better". There are often things outside of your control.

Based on your response, you seem to be the type of person causing pain for those with a pager.

Also, I'm sure the company that can make the Internet work every time, all the time, will make a killing.

Pager duty is not a band-aid. It CAN be, for poorly-managed companies, but even the most conscientious and knowledgeable company in the world is going to have unexpected failures.