Hacker News new | ask | show | jobs
by Schweigi 2038 days ago
Indeed, we had the first AWS Kinesis issues already at 13:50 (UTC). Now it's still ongoing after two hours. The status page didn't even update in the first 45 min or so...
1 comments

That's typical. The AWS status page is a marketing gimmick whose job is to stay green, not a good faith attempt to assess and report status. If there's an outage, seeing it accurately reflected on the status page is the exception, not the rule.
Isn't that fraud ?

edit: not sure why my question deserved a downvote...

If you're small, yes, if you're AWS, it's business as usual?
Updating the status dashboard is pretty low priority for operators trying to resolve this issue. It requires escalation up the management chain and careful wording.
>for operators trying to resolve this issue

It's a shame Amazon doesn't have thousands of employees to divide these tasks between different people, as it is only these busy operators who could update this status page.

If you're right, why have the status page then? It is useless by your definition yes?

Not to mention it doesn't take a technical person to update the status page.

Its even more frustrating when you are aware of problems early on and start talking to support and THEY don't even know about problems yet.

Maybe the thousands of people is what prevents status from being updated, everyone tries to hide their own faults internally even

Heck, doesn't Amazon have an AI/ML product? Make the status page reflect sentiment analysis of support conversations.
Literally 99.9% of the employees have no more knowledge than you about the inner workings of a given AWS service. This isn't to forgive their lack of updating the status page, but large engineering orgs are never the knowledge monoliths you might imagine they are.
This isn't a question of knowing a service is down. We're assuming the team that is fixing the service, knows it is down. It was a question of not having the resources to direct literally any other person in the org to log into an admin panel and flip a toggle from green to red.
I was merely addressing the "thousands of employees" non sequitur. Org structure means that the raw number of employees is a meaningless metric. The only people who are going to potentially flip that switch are going to have some sort of direct responsibility for the product. That number is going to be very similar whether it's a large company like Amazon or a smaller one like, say, Heroku or Dreamhost.
Just because it has a lag from “issues reported” to “confirmed outage” doesn’t mean it’s useless. Non-green means there are issues and Amazon is aware of them.
My comment was in the context of the assertion that a team that knows the service is down and is fixing it is too busy to update the status, therefore no one else can update this status. Certainly it is understandable that if the issue is unknown that status cannot be updated.
> Updating the status dashboard is pretty low priority for operators trying to resolve this issue

Which is why, during incident responses, there has to be people in charge of communication. Both internal and external communication, and some of this can be further delegated.

That's a poor excuse.

> It requires escalation up the management chain and careful wording

Careful wording is more important for external stakeholders who might not have the full context. If one is stepping in eggshells with internal management too, that's bad management. Incident communication should be factual and concise.

> Incident communication should be factual and concise.

Could not agree more. It's immensely frustrating working with organisations that spend more time trying to cover up the cause of a outage to external stakeholders than actually fixing the root cause.

The same organisations tend try and blame individuals for outages.

I think both are a symptom of businesses that embrace the "blame culture"

By design. If it was a good faith attempt to report status, it would be automatically updated from a flock of canaries instead of through a slow, political process.
Even that would be meaningless at the scale of AWS.

"A top of rack switch let out the blue smoke and it'll be ~30 before we can re-rack it" would impact what fraction of a fraction of a percent of canaries? Irrelevant to me, unless of course my VM lives on a box backed by that switch. ;)

The status dashboard exists for us to laugh at when things break and to convince C*Os that everything is fine. That's it.

Ehhh... the ratio of "bump in the night" problems that affect just me to genuine outages that cross regions and affect others is about 1:1, and then about 1 in 4 or 5 of the cross-region problems blow up to the scale where they feel forced to update the dashboard. So I disagree, I think a canary flock would be both meaningful and useful.

As you point out, though, the status dashboard isn't truly meant to be either of those things. I don't have any illusions about it ever changing.

As of this moment, there are more non-green services than I've ever seen. And it's steadily getting worse.

EDIT: 15 minutes later and the board is looking worse again.