Hacker News new | ask | show | jobs
by gregmac 1057 days ago
I don't monitor services at that level at all, because it means basically nothing. More acutely: the the lack of a notification doesn't tell mean everything is "ok".

I tend to monitor the actual service. If it's a web server, have something checking that a specific URL is working (tip: use something specific, not /). Likewise any other network service is pretty easy to monitor.

For backups, check the date on the most recent file in the backup target location. If that date is older than "x", something is broken. This can apply to most other types of backend apps too -- everything has some kind of output.

It's when these checks fail that you can investigate deeper and start diagnosing systemd or whatever. It's also possible there's a bigger problem -- like DNS got messed up, or the hardware died -- and checking the final outcome will catch all this.

Basically explicitly checking systemd is a lot of extra work for no real added benefit. If your systemd service is failing often enough that knowing that is the problem immediately (at the alert level) IMHO you'd be better off to spend the time fixing the service definition so it doesn't fail.

1 comments

As a counterpoint - we do both, but these two things have different goals.

Checks from the point of view of an end user are the gold standard if the service is functioning and functioning well enough. I very much agree with this. For example, with the case of postgres, something like sharp increases or decreases in query throughput or query durations is something to alert on, because this will negatively impact the applications depending on it.

However, we have incrementally implemented additional checks and dependencies between checks to speed up troubleshooting complex systems during an emergency. Instead of on-call having to, e.g., check postgres, check patroni, check consul, check consul server cluster, go back, check network, check certificates... zabbix can already compile this into a statement like "postgres is down, but that is caused by patroni not reaching the DCS, but that's caused by the consul client being down.. however, the service is running and the certificates are fine and the consul-server cluster is also fine".