| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by ris 2033 days ago
	The one thing I want to know in cases like this is: why did it affect multiple Availability Zones? Making a resource multi-AZ is a significant additional cost (and often involves additional complexity) and we really need to be confident that typical observed outages would actually have been mitigated in return.

2 comments

talawahtech 2033 days ago

Multi-AZ doesn't protect against a software/OS issue like this, Multi-AZ would be relevant if it was an infrastructure failure (e.g. underlying EC2 instances or networking).

The relevant resiliency pattern in this case would be what they refer to as cell-based architecture, where within an AZ services are broken down into smaller independent cells to minimize the blast radius.

They specifically mention in the write-up that this was a gap they plan to address, the "backend" portion of Kinesis was already cellularized but that step had not yet been completed on the "frontend".

Celluarization in combination with workload partitioning would have helped, e.g. don't run Cloudwatch, Cognito and Customer workloads on the same set of cells.

It is also important to note that celluarization only helps in this case if they limit code deployment to a limited number of cells at a time.

This YouTube video[1] of a re:invent presentation does a great job of explaining it. The cell-based stuff, starts around minute 20.

1. https://youtu.be/swQbA4zub20

talawahtech 2033 days ago

Another relevant point made in the video is that they restrict cells to a maximum size which then makes it easier to test behavior at that size. This would have also helped avoid this specific issue since the number of threads would have been tied to the number of instances in a cell.

I definitely recommend checking out the video. Even if you have seen it before, rewatching it in the context of this post-mortem really makes it hit home.

ignoramous 2033 days ago

> Another relevant point made in the video is that they restrict cells to a maximum size which then makes it easier to test behavior at that size.

Googlers would be quick to point out that Borg does this natively across all their services: https://news.ycombinator.com/item?id=19393926

joshuamorton 2033 days ago

As another googler, I'd argue that Borg's concept of cells aren't like what Amazon is calling "cells" here. Borg cells are, as far as I can tell, akin to an AWS Zone. There are similar concepts within Google that match the concept of "an application unit that is in multiple compute units but is isolated from other similar application units, and can be used for a singular customer or workload". There are multiple terms for this concept, which I'd be happy to share within Google.

brown9-2 2033 days ago

but why does a Kinesis outage due to a capacity increase affect multiple AZs, if one assumes the capacity increase (and the frontend servers impacted by it) are in a single zone?

EdwardDiego 2033 days ago

Indeed. We're paying (and designing our systems to work on multiple AZs) to reduce the risk of outages, but then their back-end services are reliant on services in a sole region?

otterley 2033 days ago

(Disclaimer: I work for AWS but opinions are my own. I also do not work with the Kinesis team.)

Nearly all AWS services are regional in scope, and for many (if not most) services, they are scaled at a cellular level within a region. Accounts are assigned to specific cells within that region.

There are very, very few services that are global in scope, and it is strongly discouraged to create cross-regional dependencies -- not just as applied to our customers, but to ourselves as well. IAM and Route 53 are notable exceptions, but they offer read replicas in every region and are eventually consistent: if the primary region has a failure, you might not be able to make changes to your configuration, but the other regions will operate on read-only replicas.

This incident was regional in scope: us-east-1 was the only impacted region. As far as I know, no other region was impacted by this event. So customers operating in other regions were largely unaffected. (If you know otherwise, please correct me.)

As a Solutions Architect, I regularly warn customers that running in multiple Availability Zones is not enough. Availability Zones protect you from many kinds of physical infrastructure failures, but not necessarily from regional service failures. So it is super important to run in multiple regions as well: not necessarily active-active, but at least in a standby mode (i.e. "pilot light") so that customers can shed traffic from the failing region and continue to run their workloads.

Corrado 2032 days ago

This outage highlighted our dependency on Cognito. Everything else we are doing can (and probably should) be replicated to another region, which would resolve these types of issues.

However, Cognito is very region specific and there is currently no way to run in active-active or even in standby mode. The problem is user accounts; you can't sync them to another region and you can't back-up/restore them (with passwords). Until AWS comes up with some way to run Cognito in a cross-region fashion, we are pretty much stuck in a single region and vulnerable to this type of outage in the future.

otterley 2032 days ago

Please bring this to the attention of your account team! They will bring your feedback to the service team. While I can’t speak for the Cognito team, I can assure you they care deeply about customer satisfaction.

Corrado 2031 days ago

That's a great idea. I'm writing an email right now! :)

roman_sf 2033 days ago

What do you mean by cross-regional dependencies? Isn't running in multi-region setup is by itself adding dependency?

Speaking about multi-region services. What do you think about Google now offering all three major building pieces as multi-regional?

They have muti-regional buckets, LB with single anycast IP, document db (firebase). Pubsub can route automatically to nearest region. Nothing like this is available in amazon, well only DIY building blocks.

otterley 2033 days ago

If your workload can run in region B even if there is a serious failure of a service in region A, in which your workload normally runs, then no, you have not created a cross-regional dependency.

When I talk about cross regional dependency, I talk about an architectural decision that can lead to a cascading failure in region B, which is healthy by all accounts, when there is a failure in region A.

AWS has services that allow for regional replication and failover. DynamoDB, RDS, and S3 all offer cross region replication. And Global Accelerator provides an anycast IP that can front regional services and fail over in the event of an incident.

roman_sf 2032 days ago

I haven't used global accelerator but it doesn't look like the same. On landing page it says: "Your traffic routing is managed manually, or in console with endpoint traffic dials and weights".

otterley 2032 days ago

“Global Accelerator continuously monitors the health of all endpoints. When it determines that an active endpoint is unhealthy, Global Accelerator instantly begins directing traffic to another available endpoint. This allows you to create a high-availability architecture for your applications on AWS.”

https://docs.aws.amazon.com/global-accelerator/latest/dg/dis...

Alternatively, global load balancing with Route 53 remains a viable, mature option as well. Health checks and failover are fully supported.

qz2 2033 days ago

Correct.

I, as many people have, discovered this when something broke in one of the golden regions. In my case cloudfront and ACM.

Realistically you can’t trust one provider at all if you have high availability requirements.

The justification is apparently that the cloud is taking all this responsibility away from people but from personal experience running two cages of kit at two datacenters the TCO was lower and the reliability and availability higher. Possibly the largest cost is navigating Harry-Potter-esque pricing and automation laws. The only gain is scaling past those two cages.

Edit: I should point out however that an advantage of the cloud is actually being able to click a couple of buttons and get rid of two cages worth of DC equipment instantly if your product or idea doesn't work out!

freehunter 2033 days ago

>you can’t trust one provider at all

The hard part with multi-cloud is, you're just increasing your risk of being impacted by someone's failure. Sure if you're all-in on AWS and AWS goes down, you're all-out. But if you're on [AWS, GCP] and GCP goes down, you're down anyway. Even though AWS is up, you're down because Google went down. And if you're on [AWS, GCP, Azure] and Azure goes down, it doesn't matter than AWS and GCP are up... you're down because Azure is down. The only way around that is architecting your business to run with only one of those vendors, which means you're paying 3x more than you need to 99.99999% of the time.

The probability that one of [AWS, Azure, GCP] is down is way higher than the probability that just one of them is down. And the probability that your two cages in your datacenter is down is way higher than the probability that any one of the hyperscalers is down.

cowsandmilk 2032 days ago

> which means you're paying 3x more than you need to 99.99999% of the time.

This would be a poor decision. If you assume AWS, GCP, and Azure would fail independently, you can pay 1.5x. Each of the 3 services would be scaled to take 50% of your traffic. If any one fails, you would then still be able to handle 100%. This is a common way to structure applications. Assuming independence means that more replicas result in less overprovisioning. 1 replica means needing to provision 2x. Having 5 independent replicas means, you need to provision 1.25x to be resilient against one failure as each replica will be scaled at 25%.

In general, N replicas need N/(N-1) over provisioning to be resilient against one replica failing.

qz2 2033 days ago

I disagree. It’s about mitigating the risk of a single provider’s failure. Single providers go down all the time. We’ve seen it from all three major cloud vendors.

freehunter 2032 days ago

You disagree with what? That relying on three vendors increases your risk of being impacted by one? That's just statistics. You can disagree with it, but that doesn't make it incorrect.

Or do you disagree that planning for a total failure of one and running redundant workloads on other vendors increases your costs 99.99999% of the time? Because that's a fairly standard SLA from each of the major vendors. Let's even reduce it to EC2's SLA, 99.99%. So 99.99% of the time you're paying 3x as much as you need to be paying just to maintain your services an extra four hours per year. Again, you can disagree with that but that doesn't make it incorrect.

Some businesses might need that extra four hours, the cost of the extra services might be cheaper than the cost of four hours of downtime per year. But you're not going to find many businesses like that. Either you're running completely redundant workloads, paying 3x as much for an extra 4 hours per year, or you're going to be taken offline when any one of the three go down independently of each other.

Single providers go down, yes. And three providers go down three times as often as one. Either you're massively overspending or you're tripling your risk of downtime. If multi-cloud worked, you'd be hearing people talking about it and their success stories would fill the front page of Hacker News. They don't, because it doesn't.

hedora 2033 days ago

How do you test failover from provider-wide outages?

I’ve never heard of an untested failover mechanism that worked. Most places are afraid to invoke such a thing, even during a major outage.

qz2 2033 days ago

That’s fairly simple. Regular scenario planning, drills and suitable chaos engineering tooling.

Being afraid of failures is a massive sign of problems. I’ve worked in those sorts of places before.

coredog64 2033 days ago

ACM and CloudFront being sticky to us-east-1 is particularly annoying. I’m happy not being multi regional (I don’t have that level of DR requirements), but these types of services require me to incorporate all the multi region complexity.

lttlrck 2033 days ago

Harry-Potter-esque pricing?

Is that a reference to the difficulty of calculating the cost of visiting all the rides at universal? That's my best guess...

qz2 2033 days ago

It's more a stab at the inconsistency of rules around magic.

"Well this pricing rule only works on a Tuesday lunch time if you're standing on one leg with a sausage under each arm and a traffic cone on your head"

And there are a million of those to navigate.

frankietaylr 2033 days ago

AWS should make this more transparent so that we make better design choices.