Hacker News new | ask | show | jobs
by ravedave5 685 days ago
The goal for oncall should be to NEVER get called. If someone gets called when they are oncall their #1 task the next day is to make sure that call never happens again. That means either fixing a false alarm or tracking down the root cause of the call. Eventually you get to a state where being called is by far the exception instead of the norm.
3 comments

This is how my team used to work when I was on call in telecoms a decade ago. In the right engineering culture, and with management buy-in, it works really well.

We deployed a new system and had one week on call for each of five team members. The first couple of rotations were hell. Almost every night ended up with at least one wake up call. As we learned how to solve each type of outage, we then taught the first-line staff how to reboot the right components so we didn’t get as many wake-ups, while we spent our days fixing the bugs. And eventually the system stopped crashing.

The on-call pay was really good (nearly double for that week) and it was a pretty sweet reward to be able to rake that in as calls stopped coming. We broke out a bottle of champagne when the first week of no calls had passed.

Eventually on-call was cancelled.

Imagine how this story would have ended if management had incentivized us differently, for example if you only got the extra pay for the nights where you got pages.

I wish everyone shared your philosophy! I once worked at a company where it was expected to get 10+ pages per day, and worse, a configuration error by a customer success team would trigger an engineering page because the error handling didn't distinguish between a config problem and an actual system issue. It was insane.
Depending on the stakes this is a pretty dangerous attitude. The goal for oncall is to keep the website working, and if you're tuning for "never get paged" then you'll necessarily miss an incident eventually.
If you make your goal as high availability as possible, and you only get paged on outages, then your goal should be to never get paged.

You should be building resilient architectures, not being on firewatch duty.

This is a classic developer vs business incentives misalignment.

Developers don't want to ever be paged because they don't want to be bothered, but the business might be perfectly happy to pay you to be on firewatch duty.

Consider a "low traffic" alert, how can you tell the difference between a slow period at 3am on a holiday vs a true outage? You can't without someone getting up and testing if the site is still up. (Maybe you can automate that check but there's always edge-cases you can't automate).

OP seemed to suggest it's better to disable the alarm than to just suffer the false alarm every now and then. I doubt very much that the people paying you for the on-call service would agree though.

> Developers don't want to ever be paged because they don't want to be bothered

This is a very reductive statement.

Developers have experienced their best colleagues burning out and leaving jobs because of on-call being completely overwhelming.

Developers want to behave intelligently.

Developers want the system to work.

Developers don’t want to burn their lifespan for false alarms that are being sent because someone didn’t spend 30 seconds thinking about whether a human being needs to be woken up in the middle of the night for whatever widget they’re slapping together.

> The goal for oncall should be to NEVER get called.

Is that not also reductive then? Or maybe my statement pretty accurately captures that sentiment without 4 sentences of explanation.

But no, instead of engaging with the meat of my argument you just reductively attack one sentence.

I get it, I'm oncall right now for my job. I don't like it when alarms go off. I also understand that if I were to tune the alarms so I "NEVER get called" I'd be out of a job soon enough because the business would go under.

Okay, dialing up good-faith engagement.

How would your interpretation change if the article said this instead?

> The goal for oncall should be to continuously tune the system toward having no outages and no false alarms.

FWIW, I did only attack one sentence. This was not exactly intended to be dismissive. It was my reaction to, in my eyes, the weakest part of your argument.

This is classic misalignment of business needs vs. perceived management wants.

Are they paying you to answer false alarms, or are they paying you to make sure the site is available and performant to keep customers happy? Nobody with half a brain wants to answer a bunch of false alarms. Are there are people that will happily get paid to ACK yet another noisy alarm just to collect a paycheck? Certainly; but these are button pushers, not problem solvers.

Your low traffic alert scenario simply requires synthetic requests. This is you you test anything with low usage, but requires high reliability.