|
As an engineer on call, I have been in this conversation so many times: "Hey, should we go red?"
"I don't know, are we sure it's an outage, or just a metrics issue?"
"How many users are affected again?"
"I can check, but I'm trying to read stack traces right now."
"Look, can we just report the issue?"
"Not sure which services to list in the outage" ...and so on. Basically, putting anything up on the status page is a conversation, and the conversation consumes engineer time and attention, and that's more time before the incident is resolved. You have to balance communication and actually fixing the damn thing, and it's not always clear what the right balance is. If you have enough people, you can have a Technical Incident Manager handle the comms and you can throw additional engineers at the communications side of it, but that's not always possible. (Some systems are niche, underdocumented, underinstrumented, etc.) My personal preference? Throw up a big vague "we're investigating a possible problem" at the first sign of trouble, and then fill in details (or retract it) at leisure. But none of the companies I've worked at like that idea, so... [shrug] |