Hacker News new | ask | show | jobs
by lifeisstillgood 4485 days ago
not trolling - but how do you configure escalations properly in a circumstance where a queue might delay longer than any arbitrary period? In short - tell me your secrets
2 comments

Well, why are you triggering on something that appears to have a completely random amount of delay? Either you choose your line in the sand, or you monitor the dependency that is causing the variability.
A typical nagios alert will fire if it hasn't been updated in X seconds. Sometimes the queue of incoming events gets backed up and nagios doesn't receive the results of service probes until X+5 seconds or 2X seconds later (due to internal nagios design problems, not the services actually being delayed).

So, nagios thinks "Service Moo hasn't contacted us in 60 seconds, ALERT!" when the update is actually in the event log, but nagios hasn't processed it yet.

I haven't seen this in ~1k services, but I guess it probably depends on the spec of the monitoring system to some degree, and I realise that 1k+ hosts is likely a different story. If you're using passive checks in any high-rate capacity, you should be using NSCA or increasing the frequency they are read in anyway. This is also another problem Icinga handles better - while I say Nagios for convenience's sake, my comments here refer to Icinga (and to Nagios XI, which is comparable but stupidly expensive).
No matter how long the delay until alerting is, there can always be the possibility of a service that stays critical for $time + 1; it's just that by delaying the alert by a min or two makes no appreciable difference for most circumstances (if it does, you should have 24/7 staff anyway) and filters out services briefly dropping out and immediately coming back, e.g. a service or host restart that happened to be caught at the wrong time.

That, and setting up proper retry intervals for checks that take a long time to execute.