|
|
|
|
|
by ElevenLathe
236 days ago
|
|
If "CPU > 80%" is not an error state for your application, then that is a pointless alert and it should be removed. Ideally alerts should only be generated when ($severity_of_potential_bad_state * $probability_of_that_state) is high. In other words, for marginally bad states, you want a high confidence before alerting. For states that are really mega bad, it may be OK to loosen that and alert when you are less confident that it is actually occurring. IME CPU% alerts are typically totally spurious in a modern cloud application. In general, to get the most out of your spend, you actually want your instances working close to their limits because the intent is to scale out when your application gets busy. Therefore, you instead want to monitor things that are as close to user experience or business metric as possible. P99 request latency, 5xx rate, etc. are OK, but ideally you go even further into application-specific metrics. For example, Facebook might ask: What's the latency between uploading a cat picture and getting its first like? |
|
It requires building risk assessment model and sensors/thresholds/alerts around it. This is quite some work which is very subjective to every case.