Hacker News new | ask | show | jobs
by devicenull 4149 days ago
> It seems, especially for major corporations, that on-call/pager duty is quickly becoming the norm for software development teams. I do agree that pager duty is a symptom of a fundamental flaw within the system/architecture. I think it would be in a company's best interest to devote time in improving the reliability and stability of their infrastructure, instead of relying on the band-aid approach that pager duty seems to be.

I can't see there ever being a time where there is no on-call requirement. You always need someone standing by in case of some terrible disaster that cannot be handled automatically. Better to have this a formal responsibility that never gets used, then to not have it and end up with an extended downtime because you can't contact anyone.

That being said, if you're getting paged continuously during on-call, then there's a bigger problem that needs to be resolved.

1 comments

> You always need someone standing by in case of some terrible disaster that cannot be handled automatically.

If it's a really terrible disaster, a once-a-decade kind of thing where everything goes haywire and you need as many staff as possible to get online ASAP, then yes. But aren't we talking more about the kinds of "disasters" that happen once a month or so, and can be handled by a few staff (not waking up the whole team). To me that sounds more like just staffing for normal operations.

At large engineering companies this is typically handled via literally having someone standing by, i.e. formally on duty, rather than having off-duty employees be on pager duty. There'll be at least a bare-bones staff on the after-hours shift (probably not in all offices, but in some kind of 24/7 operations center), enough of a staff that reasonably foreseeable things can be handled. Of course there are some pros and cons to that from an employee perspective. On the one hand the night shift isn't that pleasant, but on the other hand your responsibilities are at least formally limited to 40 hours/wk; if you're on night shift one week, you don't come in during the day, or carry a pager during the day.

> and can be handled by a few staff (not waking up the whole team).

That's what this is though. With every setup I've seen there's a rotation of primary and secondary pagers for each team. When something breaks the primary is paged, if they don't answer within a few minutes the secondary is paged. If they need outside help they can page an individual person by name or just a team. e.g. I need help from a DBA, I page the DBA team and the primary is paged.

If you have 4-5 incidents a month this gives you a team available to handle any overnight issues without having to hire a bunch of people to twiddle their thumbs 90% of the time.

That seems pretty wasteful if emergencies are rare.

We have three people on-call on my team, and we typically have an issue at most once a month - and so far, in 95% of cases, the issue can be resolved by killing an errant ec2 instance and waiting for its replacement to spin up in 5 minutes.

It would be much more disruptive and annoying if I had to work the graveyard shift even once every two months or so; aside from shifting my sleep schedule once every two months, it would be a week where I would probably be fairly unproductive.