This is a really good introduction to the different levels of infrastructure monitoring and their various pros/cons.
http://librato.com (full disclosure: I hack there) is a startup with a new kind of entry in the process-level monitoring/management space. Would love any feedback the infrastructure-minded part of the community here might have.
the problem is you will probably think of monitoring the stuff that you wished you had monitored only _after_ the crisis happened, people are built in a way that makes it hard for us to think of what may go wrong.
So another simple rule I learned with time is to trust/understand the defaults,plugins,knobs,metrics that come with well known monitoring systems ("why the hell should I monitor _that_?"). This way you use the experience of other people as a backup for your own.
How about starting with the application/business metrics first (as those are presumably easier to articulate). As things fail over time move down the stack (infra/system) to get earlier warnings?
What I have learned: Take Munin (or your solution of choice) and install all plugins for infrastructure to use. It's hard to monitor too much, only too little.
http://librato.com (full disclosure: I hack there) is a startup with a new kind of entry in the process-level monitoring/management space. Would love any feedback the infrastructure-minded part of the community here might have.