Hacker News new | ask | show | jobs
by rcardo11 2038 days ago
> This is also causing issues with Amplify, API Gateway, AppStream2, AppSync, Athena, Cloudformation, Cloudtrail, Cloudwatch, Cognito, DynamoDB, IoT Services, Lambda, LEX, Managed BlockChain, S3, Sagemaker, and Workspaces.

Well, this is a major outgage

9 comments

Indeed, we had the first AWS Kinesis issues already at 13:50 (UTC). Now it's still ongoing after two hours. The status page didn't even update in the first 45 min or so...
That's typical. The AWS status page is a marketing gimmick whose job is to stay green, not a good faith attempt to assess and report status. If there's an outage, seeing it accurately reflected on the status page is the exception, not the rule.
Isn't that fraud ?

edit: not sure why my question deserved a downvote...

If you're small, yes, if you're AWS, it's business as usual?
Updating the status dashboard is pretty low priority for operators trying to resolve this issue. It requires escalation up the management chain and careful wording.
>for operators trying to resolve this issue

It's a shame Amazon doesn't have thousands of employees to divide these tasks between different people, as it is only these busy operators who could update this status page.

If you're right, why have the status page then? It is useless by your definition yes?

Not to mention it doesn't take a technical person to update the status page.

Its even more frustrating when you are aware of problems early on and start talking to support and THEY don't even know about problems yet.

Maybe the thousands of people is what prevents status from being updated, everyone tries to hide their own faults internally even

Heck, doesn't Amazon have an AI/ML product? Make the status page reflect sentiment analysis of support conversations.
Literally 99.9% of the employees have no more knowledge than you about the inner workings of a given AWS service. This isn't to forgive their lack of updating the status page, but large engineering orgs are never the knowledge monoliths you might imagine they are.
This isn't a question of knowing a service is down. We're assuming the team that is fixing the service, knows it is down. It was a question of not having the resources to direct literally any other person in the org to log into an admin panel and flip a toggle from green to red.
Just because it has a lag from “issues reported” to “confirmed outage” doesn’t mean it’s useless. Non-green means there are issues and Amazon is aware of them.
My comment was in the context of the assertion that a team that knows the service is down and is fixing it is too busy to update the status, therefore no one else can update this status. Certainly it is understandable that if the issue is unknown that status cannot be updated.
> Updating the status dashboard is pretty low priority for operators trying to resolve this issue

Which is why, during incident responses, there has to be people in charge of communication. Both internal and external communication, and some of this can be further delegated.

That's a poor excuse.

> It requires escalation up the management chain and careful wording

Careful wording is more important for external stakeholders who might not have the full context. If one is stepping in eggshells with internal management too, that's bad management. Incident communication should be factual and concise.

> Incident communication should be factual and concise.

Could not agree more. It's immensely frustrating working with organisations that spend more time trying to cover up the cause of a outage to external stakeholders than actually fixing the root cause.

The same organisations tend try and blame individuals for outages.

I think both are a symptom of businesses that embrace the "blame culture"

By design. If it was a good faith attempt to report status, it would be automatically updated from a flock of canaries instead of through a slow, political process.
Even that would be meaningless at the scale of AWS.

"A top of rack switch let out the blue smoke and it'll be ~30 before we can re-rack it" would impact what fraction of a fraction of a percent of canaries? Irrelevant to me, unless of course my VM lives on a box backed by that switch. ;)

The status dashboard exists for us to laugh at when things break and to convince C*Os that everything is fine. That's it.

Ehhh... the ratio of "bump in the night" problems that affect just me to genuine outages that cross regions and affect others is about 1:1, and then about 1 in 4 or 5 of the cross-region problems blow up to the scale where they feel forced to update the dashboard. So I disagree, I think a canary flock would be both meaningful and useful.

As you point out, though, the status dashboard isn't truly meant to be either of those things. I don't have any illusions about it ever changing.

As of this moment, there are more non-green services than I've ever seen. And it's steadily getting worse.

EDIT: 15 minutes later and the board is looking worse again.

Thanks for this. My Lambda@Edge function was not working and I thought I broke something my permissions even though I had not touched that for atleast a month. This is the very "helpful" error message

The Lambda function associated with the CloudFront distribution is invalid or doesn't have the required permissions. We can't connect to the server for this app or website at this time. There might be too much traffic or a configuration error. Try again later, or contact the app or website owner. If you provide content to customers through CloudFront, you can find steps to troubleshoot and help prevent this error by reviewing the CloudFront documentation.

This is also affecting Fargate (at least EKS) in that its scheduling system is broken. No way to get new pods.
Fargate console is reporting no capacity in us-east-1 which is a bummer because I've lost several services that got spun down apparently due to missing cloudwatch data. But EC2 appears to be working though its taking noticeably longer to create resources. I think the take-away for a lot of people is that multiple availability zones is not a substitute for proper BCP that encompasses multiple regions or cloud providers.
Same story in ECS. Seems like virtually anything Fargate can't spawn new instances.
We're also seeing issues with FarGate ECS -- the task we had with auto-scaling scaled down to 0. The one we had with a fixed number of workers is fine.
Thanks for mentioning this - that's a nasty failure mode.
It's always a DNS issue.
That's a tough one -- I'm usually with you that it's always either DNS or cert expiry, but my go-to "it's always ..." when discussing AWS is: it's always security groups

Heh, maybe they accidentally locked themselves out of IAM, since those are great fun to troubleshoot, also

I'm also seeing weirdness with Batch. Its working, but the dashboards aren't showing job statuses accurately and jobs aren't always terminating.
and this is whats disclosed to the public
seeing issues with scaling up/down in elastic beanstalk too
yep, iot in us-east-1 not working for me