| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by nodesocket 1429 days ago

It's always more complicated than just deploying EC2 instances into multiple-az's. Here are some things I noticed from today's events.

First: RDS. I saw one of our RDS instances do a failover to the secondary zone because the primary was in the zone that had the power outage. RDS failovers are not free and have a small window of downtime (60-120s as claimed by AWS[1]).

Second: EKS (Kubernetes). One of our Kubernetes EC2 worker nodes (in EKS) went down because it was in the zone with the power outage. Kubernetes did a decent job at re-scheduling pods, but there were edge cases for sure. Mainly with Consul and Traefik running inside of the Kubernetes cluster. Finally, when the Kubernetes EC2 worker node came back up, nearly nothing got scheduled back to it. I had to manually re-deploy to get pod distribution even again. Though the last issue might be something I can improve on by using the new Kubernetes attribute topologySpreadConstraints[2].

[1] https://aws.amazon.com/premiumsupport/knowledge-center/rds-f... [2] https://kubernetes.io/docs/concepts/scheduling-eviction/topo...