Hacker News new | ask | show | jobs
by chronid 1816 days ago
My interpretation is that it's advocating for counting the thing that matter, not the consequence (e.g., the "SEV" event).

The problem is your availability being <X%, your API responding >Xs Y% of the time, not "we had X SEVs last half". People will gravitate to numbers that are talked about (because they at least appear important), so you should try to talk about the right ones.

It's hard enough to convince junior folks to open incidents without them feeling like they will mess up a very-important-management-eyes-on-it metric if they do...

1 comments

I get the point of choosing key metrics that are indicative of the reliability or quality of service. Having to narrow it down to a headline makes it less clear. For a datacentre region, counting number of outages may very well be a great bottom-line indication, with more fine grained objectives being tracked.