Hacker News new | ask | show | jobs
by bilalq 1429 days ago
It's not so cut-and-dried. The AZ isolation guarantees are not quite at the maturity they need to be.

If you're using any managed services by AWS, you need to rely on their own services to be AZ fault-tolerant. In AWS speak, they may well be (just with elevated error rates for a few minutes while load balancing shifts traffic away from a bad AZ). But as an AWS customer, you still feel the impact. As an example, one of our CodePipelines failed the deployment step with an InternalError from CloudFormation. However, the actual underlying stack deployment succeeded. When we went to retry that stage, it wouldn't succeed because the changeset to apply is no more. It required pushing a dummy change to unblock that pipeline.

Similarly, many customers run Lambdas outside of VPCs that theoretically shouldn't be tied to an AZ. You're still reliant on the AWS Lambda team to shift traffic away from a failing AZ, and until they do that, you'll see "elevated error rates" as well.