Hacker News new | ask | show | jobs
by tetha 1438 days ago
Ew, we've had similar issues in the past. These are really messy and confusing to recognize.

In our case, 1 out of 5 LB instances lost its connection to the service discovery and later on ended up not knowing about a failover of one of the 5 backends for a service. As a result, something like 1 in 20 to 1 in 25 requests got answered with a connection refused. That took a minute to find.

1 comments

Had something similar when a k8s node broke but k8s thought the pods (envoy) on it were still running so it routed 1/nth of traffic into a black hole