Hacker News new | ask | show | jobs
by toast0 236 days ago
Monitoring is one of those things where you report on what's easy to measure, because measuring the "real metric" is very difficult.

If you can take a reasonable amount of time and come up with something better for your system, great; do it. I've worked with a lot of systems where noisy alerts and human filtering was the best we could reasonably do, and it was fine. In a system like that, not every alert demands immediate response... a single high cpu page doesn't demand a quick response, and the appropriate response could be 'cpu was high for a short time, I don't need to look at anything else' Of course, you could be missing an important signal that should have been investigated, too. OTOH, if you get high cpu alerts from many hosts at once, something is up --- but it could just be an external event that causes high usage and hopefully your system survives those autonomously anyway.

Ideally, monitoring, operations and development feed into each other, so that the system evolves to work best with the human needs it has.

1 comments

So ideally, a system that can learn from your infrastructure and traffic patterns or metrics over time? Cuz that's what I'm thinking about and your last statement seems to validate it...also from what I'm getting no tool actually exists for this
I would not want to use that for alerts (automatically) but I'd consider it for suggesting new alerts to set up or potential problems. If it was at all accurate and useful.
Okay, thanks a lot, didn't see it like this
You are the tool. The human element.