Hacker News new | ask | show | jobs
by phyzome 890 days ago
As an engineer on call, I have been in this conversation so many times:

"Hey, should we go red?" "I don't know, are we sure it's an outage, or just a metrics issue?" "How many users are affected again?" "I can check, but I'm trying to read stack traces right now." "Look, can we just report the issue?" "Not sure which services to list in the outage"

...and so on. Basically, putting anything up on the status page is a conversation, and the conversation consumes engineer time and attention, and that's more time before the incident is resolved. You have to balance communication and actually fixing the damn thing, and it's not always clear what the right balance is.

If you have enough people, you can have a Technical Incident Manager handle the comms and you can throw additional engineers at the communications side of it, but that's not always possible. (Some systems are niche, underdocumented, underinstrumented, etc.)

My personal preference? Throw up a big vague "we're investigating a possible problem" at the first sign of trouble, and then fill in details (or retract it) at leisure. But none of the companies I've worked at like that idea, so... [shrug]

5 comments

This is exactly why those status pages are almost always a lie. Either they need to be fully automated without some middle manager hemming and hawing, or they shouldn’t be there at all. From a customer’s perspective, I’ve been burned so many times on those status pages that I ignore them completely. I just assume they’re a lie. So I’ll contact support straight away - the very thing these status pages were intended to mitigate.
The simple fix is to have a “last update: date time”

Or you can build a team to automate everything and force everyone and everything into a rigid update frequency which becomes a metric that applies to everyone and becomes the bane of the existence of your whole engineering organization

meh - no status page is perfectly in sync with reality, even if it's updated automatically. There's always lag and IRL, there's often partial outages.

Therefore, one should treat status pages conservatively as "definitely an outage" rather than "maybe an outage."

I think your bit at the end is the most important.

ANY communication is better than no communication "everything is fine, it must be you" is the worst feeling in these cases. Especially if your business is reliant on said service and you can't figure out why you are borked (eg the github ones).

Your point highlights thinking about what's being designed.

everything is fine is different from nothing has been reported. A green is misleading, there should be no green as green is unknown, there should be nothing with a note that there's nothing, and that's not the same as a green light.

Once an ISP support person insisted that I drive down to the shop and buy a phone handset so I could confirm presence of a dial tone on a line that my vdsl modem had line sync on before they’d tell me their upstream provider had an outage. I was… unimpressed.
> ANY communication is better than no communication

Better for the consumer, although not necessarily better for the provider if they have an SLA.

IMHO, any significant growth in 500s (that's what I was getting during the outage) warrants mention on status page. I've seen a lot of stuff, so if I see an acknowledged outage, I'll just wait for people to do their jobs. Stuff happens. If I see unacknowledged one, I get worried that people who need to know don't and that undermines my confidence in the whole setup. I'd never complain if status page says maybe there's a problem but I don't see one. I will complain in the opposite case.
And that is before 'going red' has ties to performance metrics with SLA impacts ...
Which then means not going yellow or red technically constitutes fraud.
Not necessarily. The situation can be genuinely unclear to the point where it is a judgement call, and then it becomes a matter of how to weigh the consequences.
If you're asking how many users are affected, and your service is listed as green...
What if the answer is 0.00001%?
Still seems like a yellow to me.
Connect your status page to actual metrics and decide a treshold for downtime. Boom you’re done.
Stage 1: Status is manually set. There may be various metrics around what requires an update, and there may be one or more layers of approval needed.

Problems: Delayed or missed updates. Customers complain that you're not being honest about outages.

Stage 2: Status is automatically set based on the outcome of some monitoring check or functional test.

Problems: Any issue with the system that performs the "up or not?" source of truth test can result in a status change regardless of whether an actual problem exists. "Override automatic status updates" becomes one of the first steps performed during incident response, turning this into "status is manually set, but with extra steps". Customers complain that you're not being honest about outages and latency still sucks.

Stage 3: Status is automatically set based on a consensus of results from tests run from multiple points scattered across the public internet.

Problems: You now have a network of remote nodes to maintain yourself or pay someone else to maintain. The more reliable you want this monitoring to be, the more you need to spend. The cost justification discussions in an enterprise get harder as that cost rises. Meanwhile, many customers continue to say you're not being honest because they can't tell the difference between a local issue and an actual outage. Some customers might notice better alignment between the status page and their experience, but they're content, so they have little motivation to reach out and thank you for the honesty.

Eventually, the monitoring service gets axed because we can just manually update the status page after all.

Stage 4: Status is manually set. There may be various metrics around what requires an update, and there may be one or more layers of approval needed.

Does anyone serious do this?

That’s an honest question, from a pretty experienced SRE.

In a world of unicorns and rainbows, absolutely. In the real world, it's as you probably already know: it's not that easy in a complex enough system.

Quick counter-example for GP: what if the 500 spike is due to a spike in malformed requests from a single (maybe malicious) user?

A malformed request should not lead to a 500, they should be handled and validated.
Well, in the real world it might. It should trigger a bug creation and a fix to the code, but not an incident. Now all of a sudden to decide this you need more complex and/or specific queries in your monitoring system (or a good ML-based alert system), so complexity is already going up.
Query input validation is nearly a solved problem. If you don't I would argue this is an incident if in this case 500's are returned.
You need to validate your inputs and return 4xx
True, however it also doesn’t impact other users and doesn’t justify reporting an incident on the status page.