| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by sqldba 2503 days ago

I don’t think it’s normal.

At a previous company I was on call every second week and would receive a call maybe once every few months. That was with many hundreds of servers.

At another company I’m on once a week per month and get called once or twice. That’s with just a few hundred servers.

In the first case all time was reimbursed in lieu. In the second case my salary more than makes up for any inconvenience.

However in both cases I was very proactive in defining what is on call - critical production issues only. If it’s not critical or not production then I won’t log on to look at it.

And in both cases I had a LOT of false alarms from bad alerts when starting. I had all false alarms disabled.

You’ll get push back but I didn’t care - you can’t have an alarm waking up people every night on the off chance that one in a hundred will actually be an error. And hilariously, if you started including your boss on the call, they’d quickly agree it’s not acceptable. The human cost isn’t worth it.

While there’s often tonnes of room for improvements to monitoring and alerting (root cause analysis etc) that others have mentioned - in my experience most of the metrics and alarms are garbage anyway, and can and should be done away with. If it came from a boxed product it should near all be turned off from the get go. That crap is always pointless.

Oh no a server CPU usage has increased and memory is low because - it’s doing what it’s meant to? What junk.

1 comments

debunn 2503 days ago

Thanks - yeah, all of the 5-7 incidents I'm seeing are considered high priority and require action. We get lots of the noisy false system alarms too, but those don't require me to action them thankfully.

link