| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by johnsbrayton 1085 days ago

I wish I did. My approach is that I have a ruby script that runs every five minutes and does a bunch of tests. The script takes a couple minutes to execute. It connects to servers via SSH to check things out, does end-to-end-tests, then it writes its result to a JSON file.

It runs on a Linode instance with a webapp whose sole responsibility is to respond to Pingdom requests. There are two URLs that Pingdom looks for: one that returns a 500 if the JSON file indicates an issue that warrants texting me. A second that returns a 500 if the JSON file indicates an issue that warrants emailing me for a lower priority issue. Pingdom is configured accordingly.

If for any reason the JSON file has not been written in the past 10 minutes (?) or cannot be read and parsed, both URLs return a 500.

The script has a log file, so when I get an alert I can check the log file to determine what is wrong.

This is likely atypical, but it works really well for me. My scripts do the work of monitoring the heck out of everything. I only need Pingdom (or a service like it) to monitor two URLs and do the texting/emailing.

But my overall approach is to think of monitoring like unit tests or integration tests: when I think of something that could go wrong, I try to make sure there is monitoring that can detect it and alert me. When possible, before it becomes urgent. And when something does go wrong that is not automatically detected, it's a high priority to add monitoring around that.