| HN Mirror

As a counterpoint - we do both, but these two things have different goals.

Checks from the point of view of an end user are the gold standard if the service is functioning and functioning well enough. I very much agree with this. For example, with the case of postgres, something like sharp increases or decreases in query throughput or query durations is something to alert on, because this will negatively impact the applications depending on it.

However, we have incrementally implemented additional checks and dependencies between checks to speed up troubleshooting complex systems during an emergency. Instead of on-call having to, e.g., check postgres, check patroni, check consul, check consul server cluster, go back, check network, check certificates... zabbix can already compile this into a statement like "postgres is down, but that is caused by patroni not reaching the DCS, but that's caused by the consul client being down.. however, the service is running and the certificates are fine and the consul-server cluster is also fine".