Hacker News new | ask | show | jobs
by pthomas551 2228 days ago
SLAs and maybe even company politics. Incidents are politicized at a lot of companies. Even if the official rhetoric says "when it comes to incidents we don't blame people and there are no politics" the reality is often the complete opposite.
3 comments

It's a classic example of Goodhart's Law.

Setting a goal of "Zero or Near" down times doesn't necessarily mean less down times, but it usually does mean politics, shell games and 'clever' bookkeeping by middle and lower management, all of which is clearly visible to the rank and file.

Before you had down times and customers wondering if they could trust you. After a goal of "Zero or Near", you still have "down times", and you still have customers wondering if they can trust you while they stare at a status page is clearly lying to them.

But now you have employees watching their managers lying to their bosses, and wondering why they shouldn't do the same. You have a fall in morale, because those pretty numbers usually come at the expense of the things on the "Our Values" page. But to the Suits the graphs look better, and that's the important bit.

This isn't how is always is or always goes, but I've seen it enough.

Perhaps an "outage" label causes the incident to trigger a more widely attended post-mortem meeting and so they are careful not to over-use it.
"Incident" could mean several things.

Outages mean down, badly, and require RFOs. In my experience working at an ISP we only did any sort of "blame-game" if customers demanded an RFO. We still did post-mortems but usually in a more gentle, roundtable way.

This is really hard to understand. The only way to prevent future incidents is to have a culture free of blame.

Your goal is to identify service impacts, do post mortems where no one is afraid to say something went wrong, and to improve processes to prevent future events. You can't do that if there is a culture of blame.