Hacker News new | ask | show | jobs
by sszuecs 3619 days ago
In the past we used icinga at Zalando and it scaled for us to 40k checks, after that we got huge latency problems. We use now zmon https://github.com/zalando/zmon/ which is really great, because it scales the checks, the graph database is kairosdb on top of Cassandra, which also scales and even creating alerts can be automated and also added by development teams themselves and you can easily build team dashboards and reuse checks/alerts and filter to your entities. Influxdb was a nice try, but clustering was very unstable in the beginning (tried with 0.7 and 0.8). If you don't want to be the monitoring configurator for your organization (application monitoring should also be created and maintained), I highly recommend to use zmon ( maybe Prometheus can also help). There is also a check to query Prometheus in zmon.