| We had a perfect storm of problems only 2 weeks ago. 1. A vendor tomcat application had a memory leak, consumed all the RAM on a box, and crashed with an OOM 2. The warm standby application was slightly misconfigured, and was unable to take over when the primary app crashed 3. Our nagios was configured to email us, but something had gone wrong with ssmtp 2 days prior, and was unable to contact google apps 3a. No one was paying any attention to our server metric graphs / We didn't have good enough "pay attention to these specific graphs because they are currently outside the norm" A very embarrassing day for us that one. We're now working on better graphing, and have set up a basic ssmtp check to SMS us if there is an issue. Monitoring is hard. |