|
|
|
|
|
by rsanders
2103 days ago
|
|
Can you expand on this: "AWS healthchecks each kubernetes node, but not your pods themselves". Are you talking about a keepalive connection to an unhealthy pod which is reused for multiple requests? So the failure modes are, if I understand you correctly, a) the ALB keeps sending requests through an established keep-alive HTTP connection which terminates in an unhealthy pod, but which it sees as healthy because the node is healthy and can route traffic to another, healthy pod, and b) the health of an established HTTP keepalive connection is perceived to be that of the node rather than the destination pod, so nodes which become unhealthy can cause the ALB to unnecessarily terminate a keepalive connection. We had to switch to using target-type=instance because of issues with pods not being deregistered. I'd prefer to use target-type IP but it seemed like preventing 500s on rollouts required a bit of testing and tuning with a very specific approach. e.g. introducing a longish delay on pod termination with a lifecycle hook and using the pod readiness gate support recently added to alb-ingress-controller. |
|
Here's the annotation that I used to fix that: