|
Nagios is excellent once you get comfortable with it. People have written check scripts for all sorts of bizarre hardware, interfaces (RS-232 polling, X10, etc.), and so forth. You can turn the babysitting proactive by writing event scripts, eg. when a MySQL slave goes out of sync or memcached wedges, the event handler notices, kicks it, and if it doesn't recover after a few tries, THEN you get a page. Obviously you need to be careful about what you put in an event handler, but used judiciously, they're great. Cacti, ganglia, and so forth are useful adjuncts for routers, clusters, and the like. But Nagios is a time-tested warhorse with a lot of community support. It's ugly but it works and works well. PXE + cfengine + nagios can be pulled together for Real Ultimate Power, or at least a simulacrum of what goes on at places like Google (whose scripts are custom Python monstrosities for the most part, but the functionality is pretty similar, with several levels and types of babysitter systems funneling into a more proactive statistical resource allocation type of analysis framework since the scale is so vast; at least that's the track that development was on when I left, and Urs is still in charge so I doubt it's swerved much). See here for an article on adaptive state monitoring: http://www.onlamp.com/pub/a/onlamp/2006/05/25/self-healing-n... with Nagios and Cfengine. Add something like PXE reimaging of dead nodes to the mix and you cut down workloads by an order of magnitude on large installations. If you are a sysadmin and haven't experimented with self-healing systems, you should fix that gap in your skill set. If you have a sysadmin who can't or won't, fire him. |