|
|
|
|
|
by ris
2033 days ago
|
|
The one thing I want to know in cases like this is: why did it affect multiple Availability Zones? Making a resource multi-AZ is a significant additional cost (and often involves additional complexity) and we really need to be confident that typical observed outages would actually have been mitigated in return. |
|
The relevant resiliency pattern in this case would be what they refer to as cell-based architecture, where within an AZ services are broken down into smaller independent cells to minimize the blast radius.
They specifically mention in the write-up that this was a gap they plan to address, the "backend" portion of Kinesis was already cellularized but that step had not yet been completed on the "frontend".
Celluarization in combination with workload partitioning would have helped, e.g. don't run Cloudwatch, Cognito and Customer workloads on the same set of cells.
It is also important to note that celluarization only helps in this case if they limit code deployment to a limited number of cells at a time.
This YouTube video[1] of a re:invent presentation does a great job of explaining it. The cell-based stuff, starts around minute 20.
1. https://youtu.be/swQbA4zub20