Hacker News new | ask | show | jobs
by knorker 228 days ago
To add to the reasons others gave: It needs to be correct.

Engineers are working the problem. They have a pretty good understanding of the impact of the outage. Then an external comms person asks for an engineer to proof read the external outage comms. Which triggers rounds of "no, this part is not technically correct" and "I know the internal system scope impact, but not how that maps to external product names you want to communicate".

Sure, it'd be nice if the message "we are investigating an issue with… uh… some products" would come up faster.

1 comments

I totally get that, but how hard would it be to actually make calls to your own API from the status page? If it fails, display a vague message saying there might be issues and that you are looking into it. Clearly these metrics and alerts exist internally too.

I'm not asking for an instant RCA or confirmation of the scope of the outage. Just stop gaslighting me.

> I totally get that, but how hard would it be to actually make calls to your own API from the status page?

Ah, so you're saying the status page should be hooked up to internal monitoring probers?

So how sure are you that it's the service that's broken, and not the probers? How sure are you that the granularity of the probers reflect the actual scope of the outage?

Also this opens up questioning of "well why don't you have probing on the EXACT workflow that happened to break this time?!". Because honestly, that's not helpful.

Say you have a complete end to end workflow for your web store. Should you publish "100% outage, the webstore is down!!" on your status page, automatically, because the very diligent prober failed to get into the shoe section of your store? That's probably not helpful to anybody.

> Clearly these metrics and alerts exist internally too.

Well, no. Probers can never cover every dimension across which a service can have an outage. You may think that the service is simple and has an obvious status, but you're using like 0.1% of the user surface, and have never even heard of the weird things that 99% of actual traffic does.

How do you even model your minority use case? Is it an outage? Or is your workflow maybe a tiny weird one, even though you think it's the straightforward one?

Especially since the nature of outages in complex systems tend to be complex to describe accurately. And a status page needs to boil it down to simple.

In many cases even engineers inspecting the system can not always be certain if real users are experiencing an outage, or if they're chasing an internal user, or if nothing is user visible because internal retries are taking care of everything, or what.

Complex systems are often complex because the world is complex. And if the problem is simple and unevolving then there would be no reason to have outages in the first place.

And often engineers helping phrase an outage statement need to compromise verbosity for clarity.

Another thing is what do you do if you start serving 500s to 90% of traffic? An outage, right? Surely auto-publish to a status page? Oh, but it turns out this was a DoS attack, and no non-DoS traffic was affected. Can your monitoring detect the difference? Unlikely.

Automation is always significantly easier said than done. Furthermore, as has been emphasized elsewhere in this thread, there is a critical need for a human in the loop.