Hacker News new | ask | show | jobs
by outworlder 234 days ago
I'm wondering why your and other companies haven't just evicted themselves from us-east-1. It's the worst region for outages and it's not even close.

Our company decided years ago to use any region other than us-east-1.

Of course, that doesn't help with services that are 'global', which usually means us-east-1.

9 comments

Several reasons, really:

1. The main one: it's the cheapest region, so when people select where to run their services they pick it because "why pay more?"

2. It's the default. Many tutorials and articles online show it in the examples, many deployment and other devops tools use it as a default value.

3. Related to n.2. AI models generate cloud configs and code examples with it unless asked otherwise.

4. It's location make it Europe-friendly, too. If you have a small service and you'd like to capture European and North American audience from a single location us-east-1 is a very good choice.

5. Many Amazon features are available in that region first and then spread out to other locations.

6. It's also a region where other cloud providers and hosting companies offer their services. Often there's space available in a data center not far from AWS-running racks. In hybrid cloud scenarios where you want to connect bits of your infrastructure running on AWS and on some physical hardware by a set of dedicated fiber optic lines us-east-1 is the place to do it.

7. Yes, for AWS deployments it's an experimental location that has higher risks of downtime compared to other regions, but in practice when a sizable part of us-east-1 is down other AWS services across the world tend to go down, too (along with half of the internet). So, is it really that risky to run over there, relatively speaking?

It's the world's default hosting location, and today's outages show it.

> it's the cheapest region

In every SKU I've ever looked at / priced out, all of the AWS NA regions have ~equal pricing. What's cheaper specifically in us-east-1?

> Europe-friendly

Why not us-east-2?

> Many Amazon features are available in that region first and then spread out to other locations.

Well, yeah, that's why it breaks. Using not-us-east-1 is like using an LTS OS release: you don't get the newest hotness, but it's much more stable as a "build it and leave it alone" target.

> It's also a region where other cloud providers and hosting companies offer their services. Often there's space available in a data center not far from AWS-running racks.

This is a better argument, but in practice, it's very niche — 2-5ms of speed-of-light delay doesn't matter to anyone but HFT folks; anyone else can be in a DC one state away with a pre-arranged tier1-bypassing direct interconnect, and do fine. (This is why OVH is listed on https://www.cloudinfrastructuremap.com/ despite being a smaller provider: their DCs have such interconnects.)

For that matter, if you want "low-latency to North America and Europe, and high-throughput lowish-latency peering to many other providers" — why not Montreal [ca-central-1]? Quebec might sound "too far north", but from the fiber-path perspective of anywhere else in NA or Europe, it's essentially interchangeable with Virginia.

Lots of stuff is priced differently.

Just go to the EC2 pricing page and change from us-east-1 to us-west-1

https://aws.amazon.com/ec2/pricing/on-demand/

us-west-1 is the one outlier. us-east-1, us-east-2, and us-west-2 are all priced the same.
There are many other AWS regions than the ones you listed, and many different prices.
This seems like a flaw Amazon needs to fix.

Incentivize the best behaviors.

Or is there a perspective I don't see?

How is it a flaw!? Building datacenters in different regions come with very different costs, and different costs to run. Power doesn't cost exactly the same in different regions. Local construction services are not priced exactly the same everywhere. Insurance, staff salaries, etc, etc... it all adds up, and it's not the same costs everywhere. It only makes sense that it would cost different amounts for the services run in different regions. Not sure how you're missing these easy to realize facts of life.
I think the cost of a day like Monday due to over relying on a single location outweighs that
> 5. Many Amazon features are available in that region first and then spread out to other locations.

This is the biggest one isn't it? I thought Route 53 isn't even available on any other region.

Some AWS services are only available in us-east-1. Also a lot of people have not built their infra to be portable and the occasional outage isn't worth the cost and effort of moving out.
> the occasional outage isn't worth the cost and effort of moving out.

And looked at from the perspective of an individual company, as a customer of AWS, the occasional outage is usually an acceptable part of doing business.

However, today we’ve seen a failure that has wiped out a huge number of companies used by hundreds of millions - maybe billions - of people, and obviously a huge number of companies globally all at the same time. AWS has something like 30% of the infra market so you can imagine, and most people reading this will to some extent have experienced, the scale of disruption.

And the reality is that whilst bigger companies, like Zoom, are getting a lot of the attention here, we have no idea what other critical and/or life and death services might have been impacted. As an example that many of us would be familiar with, how many houses have been successfully burgled today because Ring has been down for around 8 out of the last 15 hours (at least as I measure it)?

I don’t think that’s OK, and I question the wisdom of companies choosing AWS as their default infra and hosting provider. It simply doesn’t seem to be very responsible to be in the same pond as so many others.

Were I a legislator I would now be casting a somewhat baleful eye at AWS as a potentially dangerous monopoly, and see what I might be able to do to force organisations to choose from amongst a much larger pool of potential infra providers and platforms, and I would be doing that because these kinds of incidents will only become more serious as time goes on.

You're suffering from survivorship bias. You know that old adage about the bullet holes in the planes, and someone pointed out that you should reinforce that parts without bullet holes, because these are the planes that came back.

It's the same thing here. Do you think other providers are better? If people moved to other providers, things would still go down, more likely than not it would be more downtime in aggregate, just spread out so you wouldn't notice as much.

At least this way, everyone knows why it's down, our industry has developed best practices for dealing with these kinds of outages, and AWS can apply their expertise to keeping all their customers running as long as possible.

> If people moved to other providers, things would still go down, more likely than not it would be more downtime in aggregate, just spread out so you wouldn't notice as much.

That is the point, though: Correlated outages are worse than uncorrelated outages. If one payment provider has an outage, chose another card or another store and you can still buy your goods. If all are down, no one can shop anything[1]. If a small region has a power blackout, all surrounding regions can provide emergency support. If the whole country has a blackout, all emergency responders are bound locally.

[1] Except with cash – might be worth to keep a stash handy for such purposes.

Yeah, exactly this. I don’t know why the person who responded to me is talking about survivorship bias… and I suppose I don’t really care because there’s a bigger point.

The internet was originally intended to be decentralised. That decentralisation begets resilience.

That’s exactly the opposite of what we saw with this outage. AWS has give or take 30% of the infra market, including many nationally or globally well known companies… which meant the outage caused huge global disruption of services that many, many people and organisations use on a day to day basis.

Choosing AWS, squinted at through a somewhat particular pair of operational and financial spectacles, can often make sense. Certainly it’s a default cloud option in many orgs, and always in contention to be considered by everyone else.

But my contention is that at a higher level than individual orgs - at a societal level - that does not make sense. And it’s just not OK for government and business to be disrupted on a global scale because one provider had a problem. Hence my comment on legislators.

It is super weird to me that, apparently, that’s an unorthodox and unreasonable viewpoint.

But you’ve described it very elegantly: 99.99% (or pick the number of 9s you want) uptime with uncorrelated outages is way better than that same uptime with correlated, and particularly heavily correlated, outages.

That’s a pretty bold claim. Where’s your data to back it up?

More importantly you appear to have misunderstood the scenario I’m trying to avoid, which is the precise situation we’ve seen in the past 24 hours where a very large proportion of internet services go down all at the same time precisely because they’re all using the same provider.

And then finally the usual outcome of increased competition is to improve the quality of products and services.

I am very aware of the WWII bomber story, because it’s very heavily cited in corporate circles nowadays, but I don’t see that it has anything to do with what I was talking about.

AWS is chosen because it’s an acceptable default that’s unlikely to be heavily challenged either by corporate leadership or by those on the production side because it’s good CV fodder. It’s the “nobody gets fired for buying IBM” of the early mid-21st century. That doesn’t make it the best choice though: just the easiest.

And viewed at a level above the individual organisation - or, perhaps from the view of users who were faced with failures across multiple or many products and services from diverse companies and organisations - as with today (yesterday!) we can see it’s not the best choice.

This is an assumption.

Reality is, though, that you shouldn't put all your eggs in the same basket. And it was indeed the case before the cloud. One service going down would have never had this cascade effect.

I am not even saying "build your own DC", but we barely have resiliency if we all rely on the same DC. That's just dumb.

From the standpoint of nearly every individual company, it's still better to go with a well-known high-9s service like AWS than smaller competitors though. The fact that it means your outages will happen at the same time as many others is almost like a bonus to that decision — your customers probably won't fault you for an outage if everyone else is down too.

That homogeneity is a systemic risk that we all bear, of course. It feels like systemic risks often arise that way, as an emergent result from many individual decisions each choosing a path that truly is in their own best interests.

Yeah, but this is exactly not what the internet is supposed to be. It’s supposed to be decentralised. It’s supposed to be resilient.

And at this point I’m looking at the problem and thinking, “how do we do that other than by legislating?”

Because left to their own devices a concerningly large number of people across many, many organisations simply follow the herd.

In the midst of a degrading global security situation I would have thought it would be obvious why that’s a bad idea.

Services like SES Inbound are only available in 2x US regions. AWS isn't great about making all services available in all regions :/
We're on Azure and they are worse in every aspect, bad deployment of services, and status pages that are more about PR than engineering.

At this point, is there any cloud provider that doesn't have these problems? (GCP is a non-starter because a false-positive YouTube TOS violation get you locked out of GCP[1]).

[1]: https://9to5google.com/2021/02/26/stadia-port-of-terraria-ca...

Don't worry there was a global GCP outage a few months ago
Global auth is and has been a terrible idea.
If you can't figure out how to use a different Google account for YouTube from the GCP billing account, I don't know what to say. Google's in the wrong here, but spanner's good shit! (If you can afford it. and you actually need it. you probably don't.)
The problem isn't specifically getting locked out of GCP (though it is likely to happen for those out of the loop on what happened). It is that Google themselves can't figure out that a social media ban shouldn't affect your business continuity (and access to email or what-have-you).

It is an extremely fundamental level of incompetence at Google. One should "figure out" the viability of placing all of one's eggs in the basket of such an incompetent partner. They screwed the authentication issue up and, this is no slippery slope argument, that means they could be screwing other things up (such as being able to contact a human for support, which is what the Terraria developer also had issues with).

One of those still isn’t us-east-1 though and email isn’t latency-bound.
Except for OTP codes when doing 2fa in auth
100ms isn’t going to make a difference to email-based OTP.

Also, who’s using email-based OTP?

Same calculation everyone makes but that doesn’t stop them from whining about AWS being less than perfect.
We have discussions coming up to evict ourselves from AWS entirely. Didn't seem like there was much of an appetite for it before this but now things might have changed. We're still small enough of a company to where the task isn't as daunting as it might otherwise be.
So did a previous company i worked at, all our stuff was in west-2.. then east-1 went down and some global backend services that aws depended on also went down and effected west-2.

I'm not sure a lot of companies are really looking at the costs of multi-region resiliency and hot failovers vs being down for 6 hours every year or so and writing that check.

Yep. Many, many companies are fine saying “we’re going to be no more available than AWS is.”
Customers are generally a lot more understanding if half the internet goes down at the same time as you.
Yes, and that's a major reason so many just use us-east-1.
Is there some reason why "global" services aren't replicated across regions?

I would think a lot of clients would want that.

> Is there some reason why "global" services aren't replicated across regions?

On AWS's side, I think us-east-1 is legacy infrastructure because it was the first region, and things have to be made replicable.

For others on AWS who aren't AWS themselves: because AWS outbound data transfer is exorbitantly expensive. I'm building on AWS, and AWS's outbound data transfer costs are a primary design consideration for potential distribution/replication of services.

It is absolutely crazy how much AWS charges for data. Internet access in general has become much cheaper and Hetzner gives unlimited AWS. I don't recall AWS ever decreasing prices for outbound data transfer
I think there's two reasons: one, it makes them gobs of money. Two, it discourages customers from building architectures which integrate non-AWS services, because you have to pay the data transfer tax. This locks everyone in.

And yes, AWS' rates are highway robbery. If you assume $1500/mo for a 10 Gbps port from a transit provider, you're looking at $0.0005/GB with a saturated link. At a 25% utilization factor, still only $0.002/GB. AWS is almost 50 times that. And I guarantee AWS gets a far better rate for transit than list price, so their profit margin must be through the roof.

> I think there's two reasons: one, it makes them gobs of money. Two, it discourages customers from building architectures which integrate non-AWS services, because you have to pay the data transfer tax. This locks everyone in.

Which makes sense, but even their rates for traffic between AWS regions are still exorbitant. $0.10/GB for transfer to the rest of the Internet somewhat discourages integration of non-Amazon services (though you can still easily integrate with any service where most of your bandwidth is inbound to AWS), but their rates for bandwidth between regions are still in the $0.01-0.02/GB range, which discourages replication and cross-region services.

If their inter-region bandwidth pricing was substantially lower, it'd be much easier to build replicated, highly available services atop AWS. As it is, the current pricing encourages keeping everything within a region, which works for some kinds of services but not others.

Even their transfer rates between AZs _in the same region_ are expensive, given they presumably own the fiber?

This aligns with their “you should be in multiple AZs” sales strategy, because self-hosted and third-party services can’t replicate data between AZs without expensive bandwidth costs, while their own managed services (ElastiCache, RDS, etc) can offer replication between zones for free.

Hetzner is "unlimited fair use" for 1Gbps dedicated servers, which means their average cost is low enough to not be worth metering, but if you saturate your 1Gbps for a month they will force you to move to metered. Also 10Gbps is always metered. Metered traffic is about $1.50 per TB outbound - 60 times cheaper than AWS - and completely free within one of their networks, including between different European DCs.

In general it seems like Europe has the most internet of anywhere - other places generally pay to connect to Europe, Europe doesn't pay to connect to them.

"Is there some reason why "global" services aren't replicated across regions?"

us-east-1 is so the government to slurp up all the data. /tin-foil hat

Data residency laws may be a factor in some global/regional architectures.
So provide a way to check/uncheck which zones you want replication to. Most people aren't going to need more than a couple of alternatives, and they'll know which ones will work for them legally.
My guess is that for IAM it has to do with consistency and security. You don't want regions disagreeing on what operations are authorized. I'm sure the data store could be distributed, but there might be some bad latency tradeoffs.

The other concerns could have to do with the impact of failover to the backup regions.

Regions disagree on what operations are authorized. :-) IAM uses eventual consistency. As it should...

"Changes that I make are not always immediately visible": - "...As a service that is accessed through computers in data centers around the world, IAM uses a distributed computing model called eventual consistency. Any changes that you make in IAM (or other AWS services), including attribute-based access control (ABAC) tags, take time to become visible from all possible endpoints. Some delay results from the time it takes to send data from server to server, replication zone to replication zone, and Region to Region. IAM also uses caching to improve performance, but in some cases this can add time. The change might not be visible until the previously cached data times out...

...You must design your global applications to account for these potential delays. Ensure that they work as expected, even when a change made in one location is not instantly visible at another. Such changes include creating or updating users, groups, roles, or policies. We recommend that you do not include such IAM changes in the critical, high availability code paths of your application. Instead, make IAM changes in a separate initialization or setup routine that you run less frequently. Also, be sure to verify that the changes have been propagated before production workflows depend on them..."

https://docs.aws.amazon.com/IAM/latest/UserGuide/troubleshoo...

Global replication is hard and if they weren't designed with that in mind its probably a whole lot of work.
I thought part of the point of using AWS was that such things were pretty much turnkey?\
Mostly AWS relies on each region being its own isolated copy of each service. It gets tricky when you have globalized services like IAM. AWS tries to keep those to a minimum.
One advantage to being in the biggest region: when it goes down the headlines all blame AWS, not you. Sure you’re down too, but absolutely everybody knows why and few think it’s your fault.
For us, we had some minor impacts but most stuff was stable. Our bigger issue was 3rd party SaaS also hosted on us-east-1 (Snowflake and CircleCI) which broke CI and our data pipeline
This was a major issue, but it wasn't a total failure of the region.

Our stuff is all in us-east-1, ops was a total shitshow today (mostly because many 3rd party services besides aws were down/slow), but our prod service was largely "ok", a total of <5% of customers were significantly impacted because existing instances got to keep running.

I think we got a bit lucky, but no actual SLAs were violated. I tagged the postmortem as Low impact despite the stress this caused internally.

We definitely learnt something here about both our software and our 3rd party dependencies.

cheapest + has the most capacity