Hacker News new | ask | show | jobs
by jsmeaton 4510 days ago
We had a perfect storm of problems only 2 weeks ago.

1. A vendor tomcat application had a memory leak, consumed all the RAM on a box, and crashed with an OOM

2. The warm standby application was slightly misconfigured, and was unable to take over when the primary app crashed

3. Our nagios was configured to email us, but something had gone wrong with ssmtp 2 days prior, and was unable to contact google apps

3a. No one was paying any attention to our server metric graphs / We didn't have good enough "pay attention to these specific graphs because they are currently outside the norm"

A very embarrassing day for us that one.

We're now working on better graphing, and have set up a basic ssmtp check to SMS us if there is an issue. Monitoring is hard.

2 comments

You may want to check OpsGenie heartbeat monitoring, or essentially implement the same idea yourself. Our heartbeat monitoring expects to receive messages (via email or API) from monitoring tools periodically and notifies you via push/SMS/phone if we don't receive it over 10 minutes. I think this pattern is very useful to ensure that alert notifications is working.
> and have set up a basic ssmtp check to SMS us if there is an issue.

And what will happen when the network (or the alert server) is down?

You must put some check outside your network, with independent infrastructure. Adding another protocol on the same net is still subject to Murphy law.

Independent infrastructure is a good idea but not always feasible for everyone. At OpsGenie, to resolve this problem, we came up with a solution we refer as "heartbeat monitoring". This basically allows monitoring tools to send periodic heartbeat messages to us that indicate that the tools is up and can reach us. If we don't receive heartbeat messages from them in 10 minutes, we generate an alert and notify the admins. Not out of band management but does the trick to prevent situations like jsmeaton described.

http://support.opsgenie.com/customer/portal/articles/759603-...