|
|
|
|
|
by al_borland
242 days ago
|
|
I spent the first half of my career in ops, watching those alerts, escalating things, fixing stuff, writing EDA to fix stuff, working with monitoring teams and dev teams to tune monitoring, etc. Over time I worked my way into a dev role, but still am focused on the infrastructure. The problem you’re starting to run into is that you’re seeing the monitors as useless, which will ultimately lead to ignoring them, so when there is a problem you won’t know it. What you should be doing is tuning the monitors to make them useful. If your app will see occasional spikes that last 10 minutes, and the monitor checks every 5 minutes, set it to only create an alert after 3 consecutive failures. That creates some tolerance for spikes, but will still alert you if there is a prolonged issue that needs to be addressed due to the inevitable performance issues it will cause. If there are other alerts that happen often that need action taken, which is repeatable, that’s where EDA (Event Driven Automation) would come in. Write some code to fix what needs to be fixed, and when the alert comes in the code automatically runs to fix it. You then only need to handle it when the EDA code can’t fix the issue. Fix it once in code instead of every time you get an alert. |
|