Hacker News new | ask | show | jobs
by time0ut 2258 days ago
My general approach is to create monitors (in something like Splunk or ELK) that watch logs and fire alerts (email, SMS, PagerDuty, etc) if their conditions are met.

I create monitors for health issues like watching for out of memory or pod failures. I create monitors that compute the error rate and trend for each endpoint and alert if it crosses a threshold. Similarly, I'll create monitors for dead letter queues or email send failures or anything else that might go wrong in an app.

This may sound like a lot of monitors, but I try to log things in common ways, so a handful of monitors can watch hundreds of endpoints or queues.

Finally, for complicated mission critical systems, I build in support for synthetic transactions that avoid undesired side effects. These may generate extra trace logs in the app. Such requests are submitted on a regular schedule and the input and output logged. Then I build more monitors on these logs.