|
|
|
|
|
by dazzawazza
6745 days ago
|
|
I've just started testing nagios http://www.nagios.org/. It looks like complete overkill for a single server and I don't know how useful it is but it does look promising. It certainly looks like it would scale to tens if not hundreds of machines easily. It includes an alert structure so different events triggers different actions. For example if the database stops responding email the DBA, if it's a router email the network admin etc. Again I can't vouch for it over the long term as it's only been a week or so of testing but I can't complain atm. It's a PITA field to research and I'm trying to avoid the 'roll my own' urges as I'd quite like to write it ;) Anyone else got any ideas? There is a python based monitoring application out there somewhere that I stumbled upon about 6 months ago with a great plugin API and neato graphs but I can't find it again :( I blame google and not my incompetence :) |
|
Cacti, ganglia, and so forth are useful adjuncts for routers, clusters, and the like. But Nagios is a time-tested warhorse with a lot of community support. It's ugly but it works and works well.
PXE + cfengine + nagios can be pulled together for Real Ultimate Power, or at least a simulacrum of what goes on at places like Google (whose scripts are custom Python monstrosities for the most part, but the functionality is pretty similar, with several levels and types of babysitter systems funneling into a more proactive statistical resource allocation type of analysis framework since the scale is so vast; at least that's the track that development was on when I left, and Urs is still in charge so I doubt it's swerved much).
See here for an article on adaptive state monitoring: http://www.onlamp.com/pub/a/onlamp/2006/05/25/self-healing-n... with Nagios and Cfengine. Add something like PXE reimaging of dead nodes to the mix and you cut down workloads by an order of magnitude on large installations.
If you are a sysadmin and haven't experimented with self-healing systems, you should fix that gap in your skill set. If you have a sysadmin who can't or won't, fire him.