| Multi-AZ doesn't protect against a software/OS issue like this, Multi-AZ would be relevant if it was an infrastructure failure (e.g. underlying EC2 instances or networking). The relevant resiliency pattern in this case would be what they refer to as cell-based architecture, where within an AZ services are broken down into smaller independent cells to minimize the blast radius. They specifically mention in the write-up that this was a gap they plan to address, the "backend" portion of Kinesis was already cellularized but that step had not yet been completed on the "frontend". Celluarization in combination with workload partitioning would have helped, e.g. don't run Cloudwatch, Cognito and Customer workloads on the same set of cells. It is also important to note that celluarization only helps in this case if they limit code deployment to a limited number of cells at a time. This YouTube video[1] of a re:invent presentation does a great job of explaining it. The cell-based stuff, starts around minute 20. 1. https://youtu.be/swQbA4zub20 |
I definitely recommend checking out the video. Even if you have seen it before, rewatching it in the context of this post-mortem really makes it hit home.