Hacker News new | ask | show | jobs
by talawahtech 2034 days ago
Multi-AZ doesn't protect against a software/OS issue like this, Multi-AZ would be relevant if it was an infrastructure failure (e.g. underlying EC2 instances or networking).

The relevant resiliency pattern in this case would be what they refer to as cell-based architecture, where within an AZ services are broken down into smaller independent cells to minimize the blast radius.

They specifically mention in the write-up that this was a gap they plan to address, the "backend" portion of Kinesis was already cellularized but that step had not yet been completed on the "frontend".

Celluarization in combination with workload partitioning would have helped, e.g. don't run Cloudwatch, Cognito and Customer workloads on the same set of cells.

It is also important to note that celluarization only helps in this case if they limit code deployment to a limited number of cells at a time.

This YouTube video[1] of a re:invent presentation does a great job of explaining it. The cell-based stuff, starts around minute 20.

1. https://youtu.be/swQbA4zub20

2 comments

Another relevant point made in the video is that they restrict cells to a maximum size which then makes it easier to test behavior at that size. This would have also helped avoid this specific issue since the number of threads would have been tied to the number of instances in a cell.

I definitely recommend checking out the video. Even if you have seen it before, rewatching it in the context of this post-mortem really makes it hit home.

> Another relevant point made in the video is that they restrict cells to a maximum size which then makes it easier to test behavior at that size.

Googlers would be quick to point out that Borg does this natively across all their services: https://news.ycombinator.com/item?id=19393926

As another googler, I'd argue that Borg's concept of cells aren't like what Amazon is calling "cells" here. Borg cells are, as far as I can tell, akin to an AWS Zone. There are similar concepts within Google that match the concept of "an application unit that is in multiple compute units but is isolated from other similar application units, and can be used for a singular customer or workload". There are multiple terms for this concept, which I'd be happy to share within Google.
but why does a Kinesis outage due to a capacity increase affect multiple AZs, if one assumes the capacity increase (and the frontend servers impacted by it) are in a single zone?