| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by rcardo11 2038 days ago
	> This is also causing issues with Amplify, API Gateway, AppStream2, AppSync, Athena, Cloudformation, Cloudtrail, Cloudwatch, Cognito, DynamoDB, IoT Services, Lambda, LEX, Managed BlockChain, S3, Sagemaker, and Workspaces. Well, this is a major outgage

9 comments

Schweigi 2038 days ago

Indeed, we had the first AWS Kinesis issues already at 13:50 (UTC). Now it's still ongoing after two hours. The status page didn't even update in the first 45 min or so...

link

jjoonathan 2037 days ago

That's typical. The AWS status page is a marketing gimmick whose job is to stay green, not a good faith attempt to assess and report status. If there's an outage, seeing it accurately reflected on the status page is the exception, not the rule.

link

simlevesque 2037 days ago

Isn't that fraud ?

edit: not sure why my question deserved a downvote...

link

WrtCdEvrydy 2037 days ago

If you're small, yes, if you're AWS, it's business as usual?

link

chizhik-pyzhik 2037 days ago

Updating the status dashboard is pretty low priority for operators trying to resolve this issue. It requires escalation up the management chain and careful wording.

link

PKop 2037 days ago

>for operators trying to resolve this issue

It's a shame Amazon doesn't have thousands of employees to divide these tasks between different people, as it is only these busy operators who could update this status page.

If you're right, why have the status page then? It is useless by your definition yes?

link

chrisan 2037 days ago

Not to mention it doesn't take a technical person to update the status page.

Its even more frustrating when you are aware of problems early on and start talking to support and THEY don't even know about problems yet.

Maybe the thousands of people is what prevents status from being updated, everyone tries to hide their own faults internally even

link

rhizome 2037 days ago

Heck, doesn't Amazon have an AI/ML product? Make the status page reflect sentiment analysis of support conversations.

link

mypalmike 2037 days ago

Literally 99.9% of the employees have no more knowledge than you about the inner workings of a given AWS service. This isn't to forgive their lack of updating the status page, but large engineering orgs are never the knowledge monoliths you might imagine they are.

link

PKop 2037 days ago

This isn't a question of knowing a service is down. We're assuming the team that is fixing the service, knows it is down. It was a question of not having the resources to direct literally any other person in the org to log into an admin panel and flip a toggle from green to red.

link

Spivak 2037 days ago

Just because it has a lag from “issues reported” to “confirmed outage” doesn’t mean it’s useless. Non-green means there are issues and Amazon is aware of them.

link

PKop 2037 days ago

My comment was in the context of the assertion that a team that knows the service is down and is fixing it is too busy to update the status, therefore no one else can update this status. Certainly it is understandable that if the issue is unknown that status cannot be updated.

link

outworlder 2037 days ago

> Updating the status dashboard is pretty low priority for operators trying to resolve this issue

Which is why, during incident responses, there has to be people in charge of communication. Both internal and external communication, and some of this can be further delegated.

That's a poor excuse.

> It requires escalation up the management chain and careful wording

Careful wording is more important for external stakeholders who might not have the full context. If one is stepping in eggshells with internal management too, that's bad management. Incident communication should be factual and concise.

link

unclekev 2037 days ago

> Incident communication should be factual and concise.

Could not agree more. It's immensely frustrating working with organisations that spend more time trying to cover up the cause of a outage to external stakeholders than actually fixing the root cause.

The same organisations tend try and blame individuals for outages.

I think both are a symptom of businesses that embrace the "blame culture"

link

jjoonathan 2037 days ago

By design. If it was a good faith attempt to report status, it would be automatically updated from a flock of canaries instead of through a slow, political process.

link

35fbe7d3d5b9 2037 days ago

Even that would be meaningless at the scale of AWS.

"A top of rack switch let out the blue smoke and it'll be ~30 before we can re-rack it" would impact what fraction of a fraction of a percent of canaries? Irrelevant to me, unless of course my VM lives on a box backed by that switch. ;)

The status dashboard exists for us to laugh at when things break and to convince C*Os that everything is fine. That's it.

link

jjoonathan 2037 days ago

Ehhh... the ratio of "bump in the night" problems that affect just me to genuine outages that cross regions and affect others is about 1:1, and then about 1 in 4 or 5 of the cross-region problems blow up to the scale where they feel forced to update the dashboard. So I disagree, I think a canary flock would be both meaningful and useful.

As you point out, though, the status dashboard isn't truly meant to be either of those things. I don't have any illusions about it ever changing.

link

tootie 2037 days ago

As of this moment, there are more non-green services than I've ever seen. And it's steadily getting worse.

EDIT: 15 minutes later and the board is looking worse again.

link

manishsharan 2037 days ago

Thanks for this. My Lambda@Edge function was not working and I thought I broke something my permissions even though I had not touched that for atleast a month. This is the very "helpful" error message

The Lambda function associated with the CloudFront distribution is invalid or doesn't have the required permissions. We can't connect to the server for this app or website at this time. There might be too much traffic or a configuration error. Try again later, or contact the app or website owner. If you provide content to customers through CloudFront, you can find steps to troubleshoot and help prevent this error by reviewing the CloudFront documentation.

link

booleanbetrayal 2037 days ago

This is also affecting Fargate (at least EKS) in that its scheduling system is broken. No way to get new pods.

link

zxcvbn4038 2037 days ago

Fargate console is reporting no capacity in us-east-1 which is a bummer because I've lost several services that got spun down apparently due to missing cloudwatch data. But EC2 appears to be working though its taking noticeably longer to create resources. I think the take-away for a lot of people is that multiple availability zones is not a substitute for proper BCP that encompasses multiple regions or cloud providers.

link

sethhochberg 2037 days ago

Same story in ECS. Seems like virtually anything Fargate can't spawn new instances.

link

jpp 2037 days ago

We're also seeing issues with FarGate ECS -- the task we had with auto-scaling scaled down to 0. The one we had with a fixed number of workers is fine.

link

mcintyre1994 2037 days ago

Thanks for mentioning this - that's a nasty failure mode.

link

odiroot 2037 days ago

It's always a DNS issue.

link

mdaniel 2037 days ago

That's a tough one -- I'm usually with you that it's always either DNS or cert expiry, but my go-to "it's always ..." when discussing AWS is: it's always security groups

Heh, maybe they accidentally locked themselves out of IAM, since those are great fun to troubleshoot, also

link

bart_spoon 2037 days ago

I'm also seeing weirdness with Batch. Its working, but the dashboards aren't showing job statuses accurately and jobs aren't always terminating.

link

cpufry 2037 days ago

and this is whats disclosed to the public

link

durkie 2037 days ago

seeing issues with scaling up/down in elastic beanstalk too

link

holler 2037 days ago

yep, iot in us-east-1 not working for me

link