Hacker News new | ask | show | jobs
by naikrovek 1379 days ago
that's the whole point of a manually updated status page. you don't want automation to update it because that automation can fail. automation likely caused the outage you want to know more about.

you also don't want your automation guessing at what the problem is, or what the effects are. you want real info from a real person even if it isn't given to you the millisecond you look for it.

this is why status pages aren't updated by automation. if they're updated by a person, you know that people know about the problem, you know that people are working on the problem, and so on, which is good, but while they figure out what's going on, you see a "green" status page.

this is normal.

(this is for future readers, more than the person I am replying to.)

4 comments

IMO the reason status pages are not updated automatically is legal. SLA and other legal contracts might change if every time something is down the status page reflects that accurately, so people try to hide it.

Approached in that way a status page is almost useless, since it is not reliable and only after I found out via other sources it is updated.

I am perfectly happy with a status page that shows the, mm, status of the service. Could be as easy as not reachable, slower than usual or any generic information (a traffic light). I disagree that a status page has to show the why of the error, although of course it would be nice.

> IMO the reason status pages are not updated automatically is legal. SLA and other legal contracts might change if every time something is down the status page reflects that accurately, so people try to hide it.

you are right about legal reasons; some companies count SLA by the time and date stamps on the status page.

people hiding a real outage when users know damned well there is an outage is thankfully not common at all.

if you can design and run a 100% reliable status page which never reports incorrect information, while also reporting useful information, you will be a hero to many.

> people hiding a real outage when users know damned well there is an outage is thankfully not common at all.

Thankfully people are not hiding it as in a conspiracy to pretend nothing is wrong. But, as you see in many comments in this thread, status pages rarely reflect that something is down immediately (because they are updated manually by humans).

This delay, codified in processes, is very convenient, and to me this is purposely hiding that a service is down. People are not hiding it, but the processes that control the status page are, indeed, hiding this information. This makes status pages less useful, IMO.

Yet Discord has an “API Response Time” graph[0] and Reddit has a “5xx error rate” graph[1]. No, it doesn’t automatically create incidents, but it’s nice to confirm an issue is happening site-wide after experiencing it.

Actually looks like the metrics part of Reddit’s status page broke over 2 weeks ago

0: https://discordstatus.com/

1: https://www.redditstatus.com/

GitHub has such a graph too, but limited to internal employees only. In fact that’s how they detect failures with their deployments, elevated 500 statuses. Some individual teams will have their own dashboard, but some do not (like k8s upgrades) and they only monitor 500s.

Rest assured someone is looking into this problem right now

> Yet Discord has an “API Response Time” graph[0] and Reddit has a “5xx error rate” graph[1].

that's awesome. doesn't exist for github, yet. would be nice if it does come.

I'm going to disagree here. The point of a manually updated status page is appearances.

With proper reporting it's trivial to know which subsystem is experiencing problems, if any. It doesn't have to be very granular, just "normal", "experiencing issues", "offline". If reporting doesn't work, you should be alerted it doesn't work, and if alerting doesn't work, there needs to either be out-of-band alerting for that or someone monitoring the status at all times.

Manual overrides for status pages should exist for when the automation doesn't work of course.

At my last job we had a big screen in the office we monitored (Grafana) and we usually saw problems before the alerting kicked in - it had about a minute delay. When not in-office/during work hours, the on-call received alerts. It wasn't technically nor organisationally complex.

This is so wrong that it makes me wonder if it's satire.

"The whole point" (as you put it) of status pages was to publish high-level monitoring data to users. The monitoring process should occur outside the system that is being monitored, perhaps even on a different cloud.

Eventually, many companies realized this revealed expensive SLA violations and ended that level of transparency.

Your status page can and should report import metrics to users, like elevated error rates. Most status pages used to.

not satire.

no company will put any amount of monitoring online for anyone to see, no matter how high level. for it to be useful info, it must contain details, and information about infrastructure is usually well guarded for very good reasons.

> no company will put any amount of monitoring online for anyone to see, no matter how high level

Many companies used to do this. I remember the first time someone on HN commented, "Hey, is it possible this status page is just a useless blog now?" And people were trying to figure it out.

Companies arguably have a contractual obligation to be transparent about this data with their customers anyway, so a company like Github (where such a huge percentage of the industry is a customer) is going to leak the data one way or another.