|
|
|
|
|
by heegemcgee
2960 days ago
|
|
Overall, i would say "using email for system tasks considered harmful". I have worked several devops / sysadmin jobs where my inbox took weeks to tame with filters because of rampant abuse of automated system emails. Every alert should be actionable. And email doesn't have good reliability or timeliness - it can take hours for me to get a push notification on my phone that there is email, and in the evenings, i really shouldn't be looking at email at all. So we should be using a proper alert system via SMS (pagerduty is pretty great for this, but i also like twilio, and amazon SNS is just fine too). More germaine to the topic at hand: I'd recommend a) setting up log monitors with Nagios, or Zabbix, or your favorite tool. You want to regex match on certain strings in the log file, like "Deadlock" or "out of memory". Pass that alert on to your monitoring system and get a proper, actionable alert. And b), aggregating the logs. As far as convenient access, i'd recommend Graylog (or ELK or Splunk) if you have more than a handful of nodes. This makes it easy to search through logs or review them without signing into all those nodes. You can also push them over to Amazon Cloudwatch Logs for archival and rudimentary search. |
|
How do you know the difference between "everything is working properly" and "the logging and/or monitoring has stopped working"?