| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by riffraff 5643 days ago
	the problem is you will probably think of monitoring the stuff that you wished you had monitored only _after_ the crisis happened, people are built in a way that makes it hard for us to think of what may go wrong. So another simple rule I learned with time is to trust/understand the defaults,plugins,knobs,metrics that come with well known monitoring systems ("why the hell should I monitor _that_?"). This way you use the experience of other people as a backup for your own.

2 comments

ojilles 5643 days ago

How about starting with the application/business metrics first (as those are presumably easier to articulate). As things fail over time move down the stack (infra/system) to get earlier warnings?

link

Uchikoma 5643 days ago

What I have learned: Take Munin (or your solution of choice) and install all plugins for infrastructure to use. It's hard to monitor too much, only too little.

link