Hacker News new | ask | show | jobs
by dijit 1658 days ago
Honestly after I learned that the majority of Kubernetes nodes just proxy traffic between each other using iptables and that a load balancer can't tell the nodes apart (ones where your app lives vs ones that will proxy connection to your app) I got really worried about any kind of persistent connection in k8s land.

Since some number of persistent connections will get force terminated on scale down or node replacement events...

Cilium and eBPF looks like a pretty good solution to this though since you can then advertise your pods directly on the network and load balance those instead of every node.

4 comments

> Honestly after I learned that the majority of Kubernetes nodes just proxy traffic between each other using iptables and that a load balancer can't tell the nodes apart (ones where your app lives vs ones that will proxy connection to your app) I got really worried about any kind of persistent connection in k8s land.

There can be a difference, if your LoadBalancer-type service integration is well implemented. The externalTrafficPolicy knob determines whether all nodes should attract traffic from outside or only nodes that contain pods backing this service. For example, metallb (which attracts traffic by /32 BGP announcements to given external peers) will do this correctly.

Within the cluster itself, only nodes which have pods backing a given service will be part of the iptables/ipvs/... Pod->Service->Pod mesh, so you won't end up with scenic routes anyway. Same for Pod->Pod networking, as these addresses are already clustered by host node.

How do you keep ecmp hashing stable between rollouts?
If you're asking about connection stability in general:

- Ideally, you avoid it in your application design.

- If you need it, you set up SIGTERM handling in the application to wait for all connections to close before the process exits. You also set up "connection draining" at the load balancer to keep existing sessions to terminating Pods open but send new sessions to the new Pods. The tradeoff is that rollouts take much longer- if the session time is unbounded, you may need to enforce a deadline to break connections eventua.

You dont just wait until all connections exit, you first need to withdraw bgp announcement to the edge router, then start the wait. It’s not that simple with metal LBs. On the other hand it’s not that simple with cloud LBs either bc they also break long tcp streams when they please
We reused the LB as much as possible to avoid the BGP thing. There's a thing called MetalLB designed around that though.

https://metallb.universe.tf/

Pretty sure metallb will have same problem when you need to rotate nodes in bgp mode
You don’t :).

To do it properly you want a maglev-style layer that allows for withdrawals/drains of application servers with minimal disruption thanks to a minimum disruption maglev-style hash and draining support. This will allow you to first drain the given application server (continue maintaining existing connections, but send new ones to a secondaries for that part shard) before fully taking down the instance.

Sounds like Apache's graceful-restart.
Sort of. Processes on the same node (graceful restart) vs processes on different nodes (maglev).
Eh, a signal is a signal even if it's an RPC, but my point was to focus on the "waiting for something to end or empty before restarting" part.
ECMP hashing would be between the edge router and the IP of the LBs advertising VIPs no? The LB would maintain the mappings between the VIPs and the nodePort IPs of worker nodes that have a local service Endpoint for the requested service. I don't think this would be any different than it is without Kubernetes or am I completely misunderstanding your question?
q3k has mentioned metallb+bgp, which is basically in-cluster implementation of LoadBalancer Service type (bgp speakers are running on k8s nodes and announce /32 routes to nodes based on configuration), but it doesn't provide an answer for "stabilizing" ecmp connections when there are changes to backends. There has to be something "behind" metallb[1] that will handle not only stable hashing for connections, but keep forwarding "in-flight" flows (like established tcp sessions) to correct backends, even if packets arrive on different ingress nodes. It seems cilium has some solution for that[2] (by both bundling metallb, and having maglev-based loadbalancer implementation) but I haven't had time to dig into it, so I was curious if someone else has solved it and would be willing to share stories from the front. This is one of those rough edges around kubernetes deployments in bare metal environments and I'd love to see what can be done to make it more robust.

[1] metallb only really announces IPs so that "behind" is probably just CNI that actually handles traffic [2] https://cilium.io/blog/2020/11/10/cilium-19#maglev

Ah OK, I missed that this was MetalLB specific. Interesting that Cilium using Google's Maglev which amongst other things handles the issue of ECMP churn when nodes are taken out of service. I remember reading this in the white paper when it came out. I believe Facebook's Katran does similar. Thanks for the link.
That's if you're using a NodePort service, which the documentation explains is for niche use cases such as if you don't have a compatible dedicated load balancer. In most professional setups you do have such a load balancer and can use other types of routing that avoid this.

https://kubernetes.io/docs/concepts/services-networking/serv...

> In most professional setups you do have such a load balancer

May I ask what one might use in an AWS cloud environment to provide that load balancer within a Region?

Does IPv6 address any of these issues? It seems to me that IPv6 is capable of providing every component in the system its own globally routable address, identity (mTLS perhaps) and transparent encryption with no extra sidecars, eBPF pieces, etc.

Ingresses on EKS will set up an ALB that sends traffic directly to pods instead of nodes (basically skips the whole K8s Service/NodePort networking setup). You have to use ` alb.ingress.kubernetes.io/target-type: ip` as an annotation I think (see https://docs.aws.amazon.com/eks/latest/userguide/alb-ingress...).
> May I ask what one might use in an AWS cloud environment to provide that load balancer within a Region?

The AWS cloud controller will automatically set up an ALB for you if you configure a LoadBalancer service in Kubernetes. I've also done custom setups with AWS NLBs.

> Does IPv6 address any of these issues?

It could address some issues- you could conceivably create a CNI plugin which allocates an externally addressable IP to your Pods. Although you would probably still want a load balancer for custom routing rules and the improved reliability over DNS round robin.

Are ALB/NLB employed to handle traffic between pods in the same cluster? Or have I misunderstood the whole discussion?

My take on the 'eBPF will help solve service mesh' proposal is that it deals with not only ingress/egress traffic (where ALB/NLB are typically employed) but all traffic, including traffic between pods in a cluster. This is where my interests lay.

> Are ALB/NLB employed to handle traffic between pods in the same cluster?

You can choose to do so, or you can communicate directly via the built-in Kubernetes service discovery and CNI overlay network. There are use cases for both.

Whether load balancer can or can-not tell the nodes apart depends on load balancer and method you use to expose your service to it, as well as what kind of networking setup you use (i.e. is pod networking sensibly exposed to load balancer or ... weirdly)

Each "Service" object provides (by default, can be disabled) load-balanced IP address that by default uses kube-proxy as you described, a DNS A record pointing to said address, DNS SRV records pointing to actual direct connections (whether NodePorts or PodIP/port combinations) plus API access to get the same data out.

There are even replacement kube-proxy implementations that route everything through F5 load balancer boxes, but they are less known.

This is a concern only if you have ungraceful node termination Ie you suddenly yoink the node. In most cases when you terminate the node, k8s will (attempt to) cordon and drain the nodes, letting the pods gracefully terminate the connections before getting evicted.

If you didn’t have k8s and just used an autoscaling group of VMs you would have the same issue…