| > Short answer: Prometheus + Grafana + Alertmanager. Or, a higher-level recommendation, appropriate for most SMBs: sign up for Grafana Cloud's managed prometheus+grafana (or any equivalent external managed monitoring stack), and then follow their setup instructions to install their grafana-agent monitoring agent package (which sticks together node_exporter, several other optional exporters enable-able with config stanzas (e.g. redis_exporter, postgresql_exporter, etc.), and a log multiplexer for their Loki logging service [which is to logs-based metrics as Prometheus is to regular time-series metrics.]) Why use a managed service? Because, unless your IT department is large enough to have its own softball team, the stability/fault-tolerance of the "monitoring and alerting infra" itself is going to be rather low-priority for the company compared to other things you're being asked to manage; and also will be something you rarely need to touch... until it breaks. Which it will. You really want some other IT department whose whole job is to just make sure your monitoring and alerting stay up, doing this as their product IT focus rather than their operations IT focus. (You also want your alerting to be running on separate infra from the thing it watches, for the same reason that your status page should be on a separate domain and infra from the system it reports the status of. Having some other company own it is an easy way to achieve this.) > Regarding the type of alert itself, I send myself mails for the persistence/reminders + Telegram messages for the instant notifications. Again, higher-level rec appropriate for SMBs: sign up for PagerDuty, and configure it as the alert "Notification Channel" in Grafana Cloud. If you're an "ops team of one", their free plan will work fine for you. Why is this better than Telegram messages? Because the PagerDuty app does "critical alerts" — i.e. its notifications pierce your phone's silent/do-not-disturb settings (and you can configure them to be really shrill and annoying.) You don't want people to be able to call you at 2AM — but you do want to be woken up if all your servers are on fire. --- Also: if you're on a cloud provider like AWS/GCP/etc, it can be tempting to rely on their home-grown metrics + logging + alerting systems. Which works, right up until you grow enough to want to move to a hybrid "elastic load via cloud instances; base load via dedicated hardware leasing" architecture. At which point you suddenly have instances your "home cloud" refuses to allow you to install its monitoring agent on. Better to avoid this problem from the start, if you can sense you'll ever be going that way. (But if you know your systems aren't "scaling forever" and you'll stay in the cloud, the home-grown cloud monitoring + alerting systems are fine for what they do.) |
I must have spent about a week trying to learn just enough about prometheus and grafana (I had used grafana before with influx but for a different purpose) so that we could monitor temperature, memory, cpu, and disk (the bare minimum).
The goal was to have a single dashboard showing these critical metrics for all servers (< 100), and be able to receive email or sms alerts when things turned red.
No luck. After a week I had nothing to show for.
So I turned to Netdata. A one liner on each server and we had super sexy and fast dashboard for each server. No birds eye view, but fine. I then spent maybe 3-4 days trying to figure out how to get alerting to work (just email, but fine) and get temperature readings (or something like that).
No luck. By the end of week 2 I still had nothing, but a bunch of servers shutting down during peak hours.
Week 3 I said fuck it I'll do the stupidest thing and write my own stack. A bunch of shell scripts, deployed via ansible, capturing any metric I could think of, managed by systemd, posting to a $5/month server running a single nodejs service that would do in memory (only) averages, medians etc, and trigger alerts (email, sms, Slack maybe soon) when things get yellow or red.
By week 4 we had monitoring for all servers and for any metric we really needed.
Super cheap, super stable and absolutely no maintenance required. Sure, we probably can't monitor hundreds of servers or thousands of metrics, but we don't need to.
I really wanted to use something else, but I just couldn't :(