| > I wish we would just throw up a generic "Shit's Fucked Up. We Don't Know Why Yet, But We're Working On It" message. I gotta say, the implication that you can't register an outage until you know why it happened is pretty damning. The status page is where we look to see if services are effected, if that information can't be shared there until you understand the cause, that's very broken. The AWS status page has become kind of a joke to customers. I was encouraged to see the announcement in OP say that there is "a new version of our Service Health Dashboard" coming. I hope it can provide actual capabilities to display, well, service health. From how people talk about it, it kind of sounds like updates to the Service Health Dashboard are currently purely a manual process. Rather than automated monitoring automatically updating the Service Health Dashboard in any way at all. I find that a surprising implementation for an organization of Amazon's competence and power. That alarms me more than who it is that has the power to manually update it; I agree that I don't have enough knowledge of AWS internal org structures to have an opinion on if it's the "right" people or not. I suspect AWS must have internal service health pages that are actually automatically updated in some way by monitoring, that is, that actually work to display service health. It seems like a business decision rather than a technical challenge if the public facing system has no inputs but manual human entry, but that's just how it seems from the outside, I may not have full information. We only have what Amazon shares with us of course. |
I get that it not being updated is an annoyance, but I cannot figure out why it is the single most discussed thing about this whole event. I mean, entire services were out for almost an entire day, and if you read HN threads it would seem that nobody even cares about lost revenue/productivity, downtime, etc. The vast majority of comments in all of the outage threads are screaming about how the SHD lied.
In my entire career of consulting across many companies and many different technology platforms, never once have I seen or heard of anyone even looking at a status page outside of HN. I'm not exaggerating. Even over the last 5 years when I've been doing cloud consulting, nobody I've worked with has cared at all about the cloud provider's status pages. The only time I see it brought up is on HN, and when it gets brought up on HN it's discussed with more fervor than most other topics, even the outage itself.
In my real life (non-HN) experience, when an outage happens, teams ask each other "hey, you seeing problems with this service?" "yea, I am too, heard maybe it's an outage" "weird, guess I'll try again later" and go get a coffee. In particularly bad situations, they might check the news or ask me if I'm aware of any outage. Either way, we just... go on with our lives? I've never needed, nor have I ever seen people need, a status page to inform them that things aren't working correctly, but if you read HN you would get the impression that entire companies of developers are completely paralyzed unless the status page flips from green to red. Why? I would even go as far to say that if you need a third party's SHD to tell you if things aren't working right, then you're probably doing something wrong.
Seriously, what gives? Is all this just because people love hating on Amazon and the SHD is an easy target? Because that's what it seems like.