Hacker News new | ask | show | jobs
by blueskin_ 4485 days ago
>Have you ever been woken up by a nagios page that automatically cleared after five minutes because the incoming queue was delayed past the alert interval?

No, because I know how to configure escalations properly.

>Have you ever had your browser crash because you click on the wrong thing in the designed-in-1996-and-never-updated nagios interface and had your browser crash because it dumps 500MB of logs to your screen?

Actually, no. I've had my browser crash due to AJAX crap all the time though. Nagios' (and Icinga classic's) interface is clear, simple and logical; it's just not 2MB of worthless javascript that wastes half my CPU time, so I can see why unpopular with some user types.

>Have you ever had services wake you up with alert then clear then alert then clear again because some new intern configured a new monitor but didn't set up alerting correctly (because lol, they don't get paged, so who gives a flip if they copied and pasted the wrong template config, as is standard practice)?

No, because I know how to use time periods, and escalations again.

>Have you had to hire "nagios consultants" to figure out how to scale out your busted monitoring infrastructure because nagios was designed to run on a single core Pentium 90?

No, because it isn't, because I know the basics of Linux performance tuning, and because I've heard of Icinga and/or distributed Nagios/Icinga systems for very large scale.

Your post reads like "Have you ever crashed your car head on into a concrete wall at 70mph because it didn't brake for me?". No amount of handholding a program can do will protect users who have no clue how to use it.

I do not by any means consider myself an expert in Nagios either - if there was such a market for consultants as you claim, I'd likely be doing it and therefore be rich, but in actual fact, it's a skill just about any mid-level or better admin has.

I've inherited a Nagios config before that was a mess, that I rebuilt from scratch in a maintainable way, as well as extended. If Nagios (or MySQL pre-Oracle, for that matter) has a problem, it's amateurs attempting it, making a mess, and others judging the quality of the tool on their sloppy work. Not unique to Nagios, by any means. If there's a criticism you can level at Nagios for that, it's the lack of documentation and examples in the config files.

I'm also not denying the existence of alternatives - OpenNMS is ok, as is Zabbix, but both are far more limited in terms of available plugins and extensibility, and by nature harder to extend. Munin is good for out of the box graphing, but relatively poor for actual monitoring/alerting and hard to write new plugins for with limited availability of additional plugins. Each one is a standalone tool that's good for a purpose, and not some vaguely defined set of programs, partly nonexistent, that everyone has to hack together for themselves.

2 comments

Best approach to getting accustomed to Nagios is definitely setting it up for yourself. I used to support a mess of a Nagios server once, and on my new job, when they needed a good monitoring system I requested Nagios. Now we have our 100+ servers, 500+ switches and many other services monitored through it.

We had a college student during his summer time write up a quick nagios add/del/modify app. Took him a few hours to bring it up, now it is so easy to replace the whole configuration(s) through it.

Same here, has never crashed on us.. on the server or on the client. I don't know what this guys is talking about, maybe he's confused with the old OpenNMS or ZenOS?

not trolling - but how do you configure escalations properly in a circumstance where a queue might delay longer than any arbitrary period? In short - tell me your secrets
Well, why are you triggering on something that appears to have a completely random amount of delay? Either you choose your line in the sand, or you monitor the dependency that is causing the variability.
A typical nagios alert will fire if it hasn't been updated in X seconds. Sometimes the queue of incoming events gets backed up and nagios doesn't receive the results of service probes until X+5 seconds or 2X seconds later (due to internal nagios design problems, not the services actually being delayed).

So, nagios thinks "Service Moo hasn't contacted us in 60 seconds, ALERT!" when the update is actually in the event log, but nagios hasn't processed it yet.

I haven't seen this in ~1k services, but I guess it probably depends on the spec of the monitoring system to some degree, and I realise that 1k+ hosts is likely a different story. If you're using passive checks in any high-rate capacity, you should be using NSCA or increasing the frequency they are read in anyway. This is also another problem Icinga handles better - while I say Nagios for convenience's sake, my comments here refer to Icinga (and to Nagios XI, which is comparable but stupidly expensive).
No matter how long the delay until alerting is, there can always be the possibility of a service that stays critical for $time + 1; it's just that by delaying the alert by a min or two makes no appreciable difference for most circumstances (if it does, you should have 24/7 staff anyway) and filters out services briefly dropping out and immediately coming back, e.g. a service or host restart that happened to be caught at the wrong time.

That, and setting up proper retry intervals for checks that take a long time to execute.