Hacker News new | ask | show | jobs
by AaronBBrown 1543 days ago
This is a design flaw in Kubernetes. The article doesn't really explain what's happening though. The real problem is that there is no synchronization between the ingress controller (which manages the ingress software configuration, e.g. nginx from the Endpoints resources), kube-proxy (which manages iptables rules from the Endpoints resource), and kubelet (which sends the signals to the container). A presStop hook w/ a sleep equivalent to an acceptable timeout will handle the 99%+ cases (and the cases it doesn't will have exceeded your timeout anyhow). Things become more complicated when there are sidecar containers (say an envoy or nginx routing to another container in the same pod) and that often requires shenanigans such as shared emptyDir{} volumes that waits (with fsnotify or similar) for socket files to be closed to ensure requests are fully completed.
6 comments

It's more of a design compromise than an outright flaw though. Since you can't know if your order to shut down a pod has arrived or not in a distributed system (per the CAP theorem), you either have to do it the way k8s has already implemented it or you have to accept potentially unbounded wait pod shutdown (and by extension new release rollout) durations in times of network partitions. K8s just chose Availability over Consistency in this case.

You can argue whether it would not have been preferable to choose C over A instead (or even better, to make this configurable), but in a distributed system you will always have to trade one of these two off. The hacks with shared emptyDir volumes just moves the system back to "Consistency" mode but in a hacky way.

The most obvious design flaw of kubernetes is that the ingress-controller is pluggable and therefore not thoroughly defined.
I would say that's true for networking.k8s.io/v1beta1 Ingress, but not for networking.k8s.io/v1 which is much better.

There's still some issues around "concerns" maybe eg:

Should the Ingress also handle redirecting? ALB Ingress has its own annotations DSL to support this, and the nginx has a completely different annotations DSL to support this. I don't think Envoy does, though.

But then there's the question of supporting CDNs; some controllers support it with annotations and some through `pathType: ImplementationSpecific` and a `backend.resource` CRD (which doesn't have to be a CRD; they could become native networking.k8s.io/v1 extensions in the future that the controllers can opt in to support). This becomes great when combined with the operator framework (+ embedded kubebilder).

So, I think there's a lot of potential for things to get better.

A great success example in the ecosystem is cert-manager, that a lot of controllers rely on as a peer dependency in the cluster.

> A presStop hook w/ a sleep equivalent to an acceptable timeout will handle the 99%+ cases

That's precisely what we did in one of my previous client. To increase portability, we wrote the smallest possible sleep equivalent in C, statically linked it, stuck it into a ConfigMap and mounted it to the pods so every workload would have the same pre-stop hook.

It was funny to watch when a new starter in the team would find out about that very elegant, stable and useful hack and go "wtf is going on here?" :D

This dealt with pretty much all our 5XXs due to unclean shutdowns.

I mean, technically, you can recreate this scenario on a single host as well. Send a sigterm to an application and try to swap in another instance of it.

System fundamentals are at the heart of that problem: SIGTERM is just what it is, it's a signal and an application can choose to acknowledge it and do something or catch it and ignore it. The system also has no way of knowing what the application chose to do.

All that to say, I'm not sure it's as much of a flaw in Kubernetes as much as it's the way systems work and Kubernetes is reflecting that.

In my view it is a clear flaw that the signal to terminate can arrive while the server is still getting new requests. Being able to steer traffic based on your knowledge of the state of the system is one of the reasons why you'd want to set up an integrated environment where the load-balancer and servers are controlled from the same process.

The time to send the signal is entirely under control of the managing process. It could synchronize with the load-balancer before sending pods the term signal, and I'm unclear why this isn't done.

I don't think there is anything reasonable to synchronize with that will guarantee no new connections. You can remove the address from the control plane synchronously, but the stale config might live on in the kubelet or kube-proxy distributed throughout the cluster. I don't think you want to have blocking synchronization with every node every time you want to stop a pod.

The alternative is that you wait some amount of time before dying instead of explicit synchronization, which is exactly what this lame-duck period is. You find out that you should die ASAP, and then you decide how long you want to wait until you actually die.

I don't really see an issue with adding synchronisation, there's no fundamental reason why having endpoint consumers acknowledge updates before terminating removed pods would be horrifically expensive. Especially with endpoint slices.
With 10,000 nodes running kube-proxy it is a bit expensive and, more importantly: error prone. A problem on a single node that wasn't even talking to the app could stop that app from exiting indefinitely if acks were required and clusters this size already do gigabits of traffic in endpoints watches.

Additionally, there's no acks possible for clients of headless services, so just kube-proxy handling this doesn't go far enough.

But yeah, maybe accept that as a tradeoff for clusterip services, but more deeply integrate the real load balancer options.

And then many throw a service mesh on top of that foundation.
Why do people continue using k8s if it's so badly designed?
Its design is good enough. There's just enough protocol to make it portable, and it's almost completely extensible so you can make it do basically anything.
Because it's a good cash cow for expensive consultants