Hacker News new | ask | show | jobs
by blueskin_ 4485 days ago
No matter how long the delay until alerting is, there can always be the possibility of a service that stays critical for $time + 1; it's just that by delaying the alert by a min or two makes no appreciable difference for most circumstances (if it does, you should have 24/7 staff anyway) and filters out services briefly dropping out and immediately coming back, e.g. a service or host restart that happened to be caught at the wrong time.

That, and setting up proper retry intervals for checks that take a long time to execute.