Hacker News new | ask | show | jobs
by heegemcgee 2960 days ago
Overall, i would say "using email for system tasks considered harmful". I have worked several devops / sysadmin jobs where my inbox took weeks to tame with filters because of rampant abuse of automated system emails.

Every alert should be actionable. And email doesn't have good reliability or timeliness - it can take hours for me to get a push notification on my phone that there is email, and in the evenings, i really shouldn't be looking at email at all. So we should be using a proper alert system via SMS (pagerduty is pretty great for this, but i also like twilio, and amazon SNS is just fine too).

More germaine to the topic at hand: I'd recommend a) setting up log monitors with Nagios, or Zabbix, or your favorite tool. You want to regex match on certain strings in the log file, like "Deadlock" or "out of memory". Pass that alert on to your monitoring system and get a proper, actionable alert.

And b), aggregating the logs. As far as convenient access, i'd recommend Graylog (or ELK or Splunk) if you have more than a handful of nodes. This makes it easy to search through logs or review them without signing into all those nodes. You can also push them over to Amazon Cloudwatch Logs for archival and rudimentary search.

2 comments

> Every alert should be actionable.

How do you know the difference between "everything is working properly" and "the logging and/or monitoring has stopped working"?

Who will watch the watchmen, right? :D It's a real concern. Personally, i have a monitoring agent running, and then i have the config management agent (puppet) validate that the monitoring agent is running.

And what Dewey said is absolutely right - you can monitor the code / service itself through health checks. In the case of a reports service, perhaps your monitor asks the API for a very small report. Or you could implement a special endpoint / controller that calls on the core code. I recently implemented a monitor that emulates a typical user session, logging in, performing popular tasks, and logging out. If any step in that process has an error, i get an alert with the step listed, and i instantly have some idea of where things are jammed. In this manner, i don't need to have pre-defined log monitors for specific errors; i can catch novel error types by virtue of exercising the code and watching for the expected responses - 200 in the header, ability to perform tasks that are only available on login, checking for certain strings in the response, etc.

You instrument your code instead of just logging. To see that your metric export is working you can regularly export a simple value and check for it’s existence.
I just made it get the job done and email was the easiest way to do it. Also, I get emails pretty quickly, there isn't even a delay of 5 minutes.

Thanks for the concern :)