Hacker News new | ask | show | jobs
by codeduck 1646 days ago
How are health checks supposed to help when you can't do anything?
1 comments

You said:

> it's all very well to say "expect to lose an AZ" but during this outage it's not been physically possible to remove the broken AZ instances from multi-AZ services because we cannot physically get them to respond to or acknowledge commands

"Expect to lose an AZ" includes not being able to make any changes to existing instances in the affected AZ.

If you had instances across multiple AZs behind an ELB with health checks, then the ELB should automatically remove the affected instances.

If you have a different architecture, you would want to: * Have another mechanism that automatically stops sending traffic to impaired instances (ideal), or * Have a means to manually remove the instances from service without being able to interact with or modify those instances in any way

Does that help, or have I misunderstood your problem?

You have misunderstood our problem. Ec2 behind elb/alb/nlb is the least of our issues.