| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by kklimonda 1695 days ago
	How do you keep ecmp hashing stable between rollouts?

3 comments

dharmab 1695 days ago

If you're asking about connection stability in general:

- Ideally, you avoid it in your application design.

- If you need it, you set up SIGTERM handling in the application to wait for all connections to close before the process exits. You also set up "connection draining" at the load balancer to keep existing sessions to terminating Pods open but send new sessions to the new Pods. The tradeoff is that rollouts take much longer- if the session time is unbounded, you may need to enforce a deadline to break connections eventua.

link

dilyevsky 1695 days ago

You dont just wait until all connections exit, you first need to withdraw bgp announcement to the edge router, then start the wait. It’s not that simple with metal LBs. On the other hand it’s not that simple with cloud LBs either bc they also break long tcp streams when they please

link

dharmab 1694 days ago

We reused the LB as much as possible to avoid the BGP thing. There's a thing called MetalLB designed around that though.

https://metallb.universe.tf/

link

dilyevsky 1693 days ago

Pretty sure metallb will have same problem when you need to rotate nodes in bgp mode

link

q3k 1695 days ago

You don’t :).

To do it properly you want a maglev-style layer that allows for withdrawals/drains of application servers with minimal disruption thanks to a minimum disruption maglev-style hash and draining support. This will allow you to first drain the given application server (continue maintaining existing connections, but send new ones to a secondaries for that part shard) before fully taking down the instance.

link

rhizome 1695 days ago

Sounds like Apache's graceful-restart.

link

transitorykris 1695 days ago

Sort of. Processes on the same node (graceful restart) vs processes on different nodes (maglev).

link

rhizome 1692 days ago

Eh, a signal is a signal even if it's an RPC, but my point was to focus on the "waiting for something to end or empty before restarting" part.

link

bogomipz 1695 days ago

ECMP hashing would be between the edge router and the IP of the LBs advertising VIPs no? The LB would maintain the mappings between the VIPs and the nodePort IPs of worker nodes that have a local service Endpoint for the requested service. I don't think this would be any different than it is without Kubernetes or am I completely misunderstanding your question?

link

kklimonda 1695 days ago

q3k has mentioned metallb+bgp, which is basically in-cluster implementation of LoadBalancer Service type (bgp speakers are running on k8s nodes and announce /32 routes to nodes based on configuration), but it doesn't provide an answer for "stabilizing" ecmp connections when there are changes to backends. There has to be something "behind" metallb[1] that will handle not only stable hashing for connections, but keep forwarding "in-flight" flows (like established tcp sessions) to correct backends, even if packets arrive on different ingress nodes. It seems cilium has some solution for that[2] (by both bundling metallb, and having maglev-based loadbalancer implementation) but I haven't had time to dig into it, so I was curious if someone else has solved it and would be willing to share stories from the front. This is one of those rough edges around kubernetes deployments in bare metal environments and I'd love to see what can be done to make it more robust.

[1] metallb only really announces IPs so that "behind" is probably just CNI that actually handles traffic [2] https://cilium.io/blog/2020/11/10/cilium-19#maglev

link

bogomipz 1695 days ago

Ah OK, I missed that this was MetalLB specific. Interesting that Cilium using Google's Maglev which amongst other things handles the issue of ECMP churn when nodes are taken out of service. I remember reading this in the white paper when it came out. I believe Facebook's Katran does similar. Thanks for the link.

link