Hacker News new | ask | show | jobs
by burnthrow 1991 days ago
> It is very unusual for anything to come up overnight and being paged is the last resort.

This does not jive with my experience. Most companies aren't Google as you've described it, and in most cases the person on pager duty is the first human examining the incident.

> If there was a sleepless night

How about if they get woken up each night for an alarm that turns out to not be a big deal? That is the typical on-call experience, getting woken up for 15-30 minutes each night, cortisol from 0 to 100 in the 15 seconds it takes to get into Work Mode.

I guess that doesn't qualify as "sleepless" but between it and the general stress of not being able to turn the phone on silent, I'd call it "shit sleep." Nobody should be subject to it. How can you expect somebody to produce decent software in this condition?

4 comments

That is the typical on-call experience, getting woken up for 15-30 minutes each night

There is no "typical on-call experience". Some teams have an on-call rotation that goes a year without being used. Some oncalls get paged once a week and it's a serious issue that will take an hour or two to resolve. Some oncalls are impossible to handle, with alerts every few hours.

How does your experience at some other company help understand this guy's company. Makes no sense as a response.

This is like when I said I once had unlimited vacation and I took eight weeks off a year for three years and people were like "That's not my experience". Okay, well, sucks for you. No one can do anything with that.

> That is the typical on-call experience, getting woken up for 15-30 minutes each night, cortisol from 0 to 100 in the 15 seconds it takes to get into Work Mode.

The only company I ever had that happen with was the big company. The other two companies I've worked with that had on-call experiences, if anything like that happened, we would be tweaking alarm levels so it didn't happen anymore.

If you're not tweaking alarm levels or fixing code to clear out false alarms, it's not a sustainable on-call rotation and that needs to be fixed immediately.

I've been the solitary on-call for the main service of a company before and I almost never got called because 1) we had good KB articles for the operations center for when things did break; and, 2) things very rarely broke in a way that wasn't automatically fixable

It's amazing how many cases "remove broken machine from pool automatically and then restart service and bring that machine back on service crash" is a valid fix for the weird, extra edge case junk that would otherwise be a call.

I've experienced this at small startups and BigCorps. Granted, at the BigCorps fewer things blew up in general, and when they did, it was interesting.
> How about if they get woken up each night for an alarm that turns out to not be a big deal?

Write a post-mortem on this "crying wolf" fact. It is definitely a bug in your alerting rules, so actions have to be taken, otherwise others will routinely ignore important alerts.