|
|
|
|
|
by toast0
236 days ago
|
|
Monitoring is one of those things where you report on what's easy to measure, because measuring the "real metric" is very difficult. If you can take a reasonable amount of time and come up with something better for your system, great; do it. I've worked with a lot of systems where noisy alerts and human filtering was the best we could reasonably do, and it was fine. In a system like that, not every alert demands immediate response... a single high cpu page doesn't demand a quick response, and the appropriate response could be 'cpu was high for a short time, I don't need to look at anything else' Of course, you could be missing an important signal that should have been investigated, too. OTOH, if you get high cpu alerts from many hosts at once, something is up --- but it could just be an external event that causes high usage and hopefully your system survives those autonomously anyway. Ideally, monitoring, operations and development feed into each other, so that the system evolves to work best with the human needs it has. |
|