Hacker News new | ask | show | jobs
by mdip 1243 days ago
Which means if one were to require monitoring and status pages to be connected, one of two things happen (for each monitored component):

(1) The monitoring system would be altered to ignore tests that return false positives (at the expense of missing the alert when it represents an outage).

(2) Fixing the monitoring. It wasn't working for the sysadmins/operators, anyway, since it had so many false positives that their "mental model" was essentially based on (1), anyway.

At least, where I've forced the issue of doing just this, that's exactly what happened. At the end of the day, especially since SLAs took a hit and that affected bonus payouts, monitoring got a lot better -- as did overall team function when we truly realized how bad things were -- we stopped doing workarounds and started fixing problems at a more fundamental level which led to SLAs that were both accurate and excellent.

It helped bring attention to a hidden problem which resulted in time being allocated to fix tests that dropped constant false-positives and to evaluate each for whether or not it should exist in the first place.