Hacker News new | ask | show | jobs
by danfromberlin 3481 days ago
Hi dozzie [edit: typo],

I can entirely sympathize with your comment here, but I'd like to try to quickly present a contrary perspective:

Indeed there are architectural trade-offs made by every monitoring system, and icinga2 makes some that we also have found to be frustrating: one such example that comes to mind for me personally is that icinga aims to maintain backwards compatibility with its historically-derived nagios configuration file syntax which is difficult to understand and hard to parse in an automated fashion.

On the other hand, there are exampls of architectural choices that I believe icinga gets right: It implements an approach to secure and authenticated metrics collection that virtually every other monitoring system leaves as an "exercise" for the user. It provides checks and alerts and notification thershholds by default, which many other monitoring systems don't.

We build monitor in a box to attempt to highlight one particular approach to using icinga 2 which we find works for us. We attempt to be systematic and through in our approach, aiming for a reproducible, Ansible based implementation that emphasizes modularity and code reuse. We invite everyone to try out our open source offering to decide for his or her self whether the benfits of running icinga 2 in this fashion outweigh the drawbacks.

1 comments

> Hi dozie,

It's two "z" there.

> [...] icinga aims to maintain backwards compatibility with its historically-derived nagios configuration file syntax which is difficult to understand and hard to parse in an automated fashion.

Parsing Icinga/Nagios configuration file is easy, even if you count object templates (register=0 and use foo) and use handcrafted parser instead of generated one. The syntax is not a complicated one. I don't know what problems have you encountered.

> On the other hand, there are exampls of architectural choices that I believe icinga gets right: It implements an approach to secure and authenticated metrics collection that virtually every other monitoring system leaves as an "exercise" for the user.

Oh, this is more or less easy task if your monitoring system has some secret (e.g. X.509 certificate) exchanged with the monitored hosts, and can be bolted on pretty much any monitoring system with some stunnel-fu (which proves that it's nothing on the architecture side of the system).

It's sharing that secret in a robust and automatic way that is quite difficult. I doubt Icinga does anything better than the rest of the crowd.

> It provides checks and alerts and notification thershholds by default, which many other monitoring systems don't.

Once you have an established flow of monitoring messages, then thresholds, alerts, and notification become simple stream processing and consumption. Sure, Icinga and others give you this simple processing in the package, and some systems give you a few more queries than others (e.g. originally Nagios only processed what I call "state", while Cacti only processed metrics, and Zabbix processes both). But this processing rarely is complex. And none of them give you an ability to process the data stream itself.

And then there is this almost universally shared requirement that you need to define all the instances of hosts and services beforehand, only differing how the template system is implemented in a given monitoring system.

You can't just start collecting data about servers as they get installed and about services and resources as they emerge and disappear. No, you need to tell the monitoring system that it should expect data from this host and this service (collectd, Graphite, and InfluxDB got it right here).

It is useful sometimes for monitoring system to expect some data to show up (and possibly alert that it's missing), only sometimes, not all the time. Usually other data can easily cover the scenario where something silently goes down, and there's still this "stream processing" thing I mentioned that can just monitor that some data stopped being received.

> You can't just start collecting data about servers as they get installed and about services and resources as they emerge and disappear.

You can do that in Zabbix, it has two functions to discover hosts (called "network discovery") or "features" like running services or e.g. switch ports (called "low level discovery") and apply templates based on certain query results (open ports, SNMP values etc.). Alerting the lack of new incoming data is also possible. I used Zabbix a lot until a year ago and liked it much more than Nagios+descendants.

Discovery mechanism like this may be nice in some situations, but it doesn't change the underlying architectural problem: you still need to go through central configuration of monitoring system to get something watched.

This architecture results in much longer feedback loop, [object to be monitored -> discovery -> config -> collection engine -> probe -> object's state -> storage] vs. [object to be monitored -> probe -> object's state -> storage]. And you have to move requests/network packets back and forth just to collect the usage for each CPU on a machine separately (or NICs usage, or filesystem usage, or VM/LXC guests, or Apache/nginx vhosts, or...), because the knowledge what objects are there to monitor is on the monitored host's side, not on the monitoring system's side.

And you're limited to whatever resources this predefined discovery mechanism supports. You can't easily write your own probe that discovers things on its own.

And on top of that, discovery mechanism doesn't allow temporary, ad-hoc defined things to be monitored. You can't imagine how often I set a screen with a shell loop to collect a metric or watch state for this one particular process I was debugging:

  while sleep 1; do foo; done | tee /tmp/foo.log
Why not chart that from a monitoring system? Why not display its state on a dashboard? Why not alert in regular way?