Hacker News new | ask | show | jobs
by riffraff 5643 days ago
the problem is you will probably think of monitoring the stuff that you wished you had monitored only _after_ the crisis happened, people are built in a way that makes it hard for us to think of what may go wrong.

So another simple rule I learned with time is to trust/understand the defaults,plugins,knobs,metrics that come with well known monitoring systems ("why the hell should I monitor _that_?"). This way you use the experience of other people as a backup for your own.

2 comments

How about starting with the application/business metrics first (as those are presumably easier to articulate). As things fail over time move down the stack (infra/system) to get earlier warnings?
What I have learned: Take Munin (or your solution of choice) and install all plugins for infrastructure to use. It's hard to monitor too much, only too little.