Hacker News new | ask | show | jobs
by dpix 2225 days ago
I hope they change the status page from "Incident" to "Outage". Too often lately have I seen status pages (mainly Github) reporting incidents when a feature or site is down completely. Seems like their SLAs are influencing the reporting of their uptime...
5 comments

On the flip side, it looks like Slack is back up for me but their status page is also lagging at the tail-end. I highly doubt they would manipulate their uptime status for SLA purposes. Highly doubtful that companies that sign SLAs would depend on their status page to determine whether or not the contract was met.
I agree, I don't think it's malicious and Slack did the right thing here.

My comment was more of a dig at Github specifically. Recently things haven't been working for me (from my perspective an outage) and when you go to their status page it just says "Degraded performance". When your performance has degraded to the point of failure it's time to acknowledge that is in fact an outage...

SLAs and maybe even company politics. Incidents are politicized at a lot of companies. Even if the official rhetoric says "when it comes to incidents we don't blame people and there are no politics" the reality is often the complete opposite.
It's a classic example of Goodhart's Law.

Setting a goal of "Zero or Near" down times doesn't necessarily mean less down times, but it usually does mean politics, shell games and 'clever' bookkeeping by middle and lower management, all of which is clearly visible to the rank and file.

Before you had down times and customers wondering if they could trust you. After a goal of "Zero or Near", you still have "down times", and you still have customers wondering if they can trust you while they stare at a status page is clearly lying to them.

But now you have employees watching their managers lying to their bosses, and wondering why they shouldn't do the same. You have a fall in morale, because those pretty numbers usually come at the expense of the things on the "Our Values" page. But to the Suits the graphs look better, and that's the important bit.

This isn't how is always is or always goes, but I've seen it enough.

Perhaps an "outage" label causes the incident to trigger a more widely attended post-mortem meeting and so they are careful not to over-use it.
"Incident" could mean several things.

Outages mean down, badly, and require RFOs. In my experience working at an ISP we only did any sort of "blame-game" if customers demanded an RFO. We still did post-mortems but usually in a more gentle, roundtable way.

This is really hard to understand. The only way to prevent future incidents is to have a culture free of blame.

Your goal is to identify service impacts, do post mortems where no one is afraid to say something went wrong, and to improve processes to prevent future events. You can't do that if there is a culture of blame.

What I've noticed Slack doing is that a couple weeks after an "Incident", it's completely gone from the history when you go back and look at past incidents on that day. Keep an eye on it.
Looks like it's now updated!
Glad they have an SLA but most of us are free users of Slack. Beggar can't be chooser, right?