|
|
|
|
|
by time0ut
2258 days ago
|
|
My general approach is to create monitors (in something like Splunk or ELK) that watch logs and fire alerts (email, SMS, PagerDuty, etc) if their conditions are met. I create monitors for health issues like watching for out of memory or pod failures. I create monitors that compute the error rate and trend for each endpoint and alert if it crosses a threshold. Similarly, I'll create monitors for dead letter queues or email send failures or anything else that might go wrong in an app. This may sound like a lot of monitors, but I try to log things in common ways, so a handful of monitors can watch hundreds of endpoints or queues. Finally, for complicated mission critical systems, I build in support for synthetic transactions that avoid undesired side effects. These may generate extra trace logs in the app. Such requests are submitted on a regular schedule and the input and output logged. Then I build more monitors on these logs. |
|