Hacker News new | ask | show | jobs
by david-giesberg 3032 days ago
David from the Atlassian SRE team here. AWS Direct Connect is experiencing an outage in their US East Region: https://status.aws.amazon.com, which is causing connectivity issues for most Atlassian products and services. We're working hard to get everything back up and running. Please check http://status.atlassian.com for the latest updates. We're posting regularly and will continue to provide updates there.
2 comments

Why does Bitbucket depend on AWS DC? Why wouldn't I just connect to it over the internet? Or is part of it in your own DC?
Yea, this makes no sense to me. I have a pretty heavy workload in AWS (us-east-1), don't use DC AT ALL and nothing is down for me today (except Atlassian Jira/Confluence Cloud), we self host BB. Why their 'cloud' based application relies on DC is very odd.
I don't know but my guess would be anything that isn't core storage - we know they run their own SAN on their own hardware because that was the cause of another outage a month or two ago.

At a guess:

- Bitbucket Pipelines

- Webhook workers

- Front-end web servers

- SSH push/pull workers

Basically anything that's elastic to demand. Presumably the cost of AWS storage makes it not worth it for the Bitbucket team.

I'm not really a networking guy, so perhaps this is an obvious question, but why don't you have a failover configuration to send traffic over VPN or the public Internet? I would expect the latency to increase but otherwise still work.

Is it a cost concern, is DC reliable enough that it's just an accepted risk, or is there some other reason?

Hello, I'm Irena from the Networking Engineering team at Atlassian. I have been directly involved with this incident and wanted to provide some answers to the questions. We’ve built our architectures based on the AWS Direct Connect service because it’s the most reliable and scalable solution based on our customer and network needs. The AWS Direct Connect service we use in the US East Region has multiple redundant links (4x 10Gbps) optimized for data throughput requirements and availability, and to our knowledge the AWS Direct Connect transit facilities have power backups that would help contribute to its reliability. But, as we saw from today’s event, something still failed.

I should note that we have both publicly and privately reachable resources in AWS. The publicly reachable resources have fail-overs built in for situations like these (it happens automatically), but the private reachable resources with our architecture depend solely on AWS Direct Connect. For example, our Bitbucket failure today was due to the fact that we rely on AWS Direct Connect to link between the Bitbucket Cloud components that we host in our data centers and others that we host on AWS. Bitbucket could continue connecting to services in our own data centers and the public Internet/AWS, but could not talk to the privately reachable resources in the Atlassian infrastructure hosted on AWS.

We understand the importance and the impact for our customers, and dedicated several teams to this issue as soon as it was reported. AWS has resolved the issue, but we will look into ways to help prevent and better mitigate these types of issues in the future as part of our incident review and improvement processes.