Hacker News new | ask | show | jobs
by ninkendo 3049 days ago
Seems like the kind of thing that a Deployment should be able to manage on its own... some kind of DrainPolicy object maybe?

Also, if the previous ReplicaSet a Deployment is rolling past has several pods, maybe only some of them need to stay alive (maybe some drain sooner than others.)

Perhaps the whole endeavor should just be to make Pod drainage a bit more explicit than just terminationGracePeriodSeconds... perhaps letting a pod signal with a positive confirmation that it's shutting down (letting connections drain) and the rest of the k8s controllers can just leave it alone until it terminates itself.

Although really, I think a combination of setting terminationGracePeriodSeconds to unlimited, and having a health check that ensures that it doesn't get wedged and miss the termination signal (by checking that a pod status of "shutting down" corresponds to some property of the container, like a health endpoint saying the shutdown is in progress...) and then nothing else needs to be done. Basically, color me skeptical when they say:

"We used service-loadbalancer to stick sessions to backends and we turned up the terminationGracePeriodSeconds to several hours. This appeared to work at first, but it turned out that we lost a lot of connections before the client closed the connection. We decided that we were probably relying on behavior that wasn’t guaranteed anyways, so we scrapped this plan."

(This also depends on the container obeying the standard SIGTERM contract to properly drain connections but not accept new ones, which is pretty standard in most web servers nowadays.)

1 comments

yeah I don't know why terminationGracePeriodSeconds hacks didn't work. It could have been a different, unrelated factor that we didn't discover. It certainly could have been service-loadbalancer/haproxy's fault instead of the termination grace period itself. I'm certainly happy to be proven wrong there.
Not 100% sure about your scenario, but if you set a preStop hook to an exec probe you can arbitrarily delay shutdown inside the gracePeriod, because the kubelet won’t terminate the container until preStop returns.

So if you set a 5 hour grace period, and a preStop hook that invokes a script that doesn’t return until all connections are closed (but which tells the container process not to accept new ones) you can control the drain rate.

There is some app level smarts required - to have new connections rejected and have any proxies rebalance you. Haproxy does this in most cases, but the service proxy won’t (in iptables mode).

If that’s not the behavior you’re seeing, please open a bug on Kube and assign me (this is something I maintain)

Yeah I think that there is still some potential in the terminationGracePeriod strategy, but we found this other way that worked reliably and stopped exploring that path. If I can repro the issue I'll let you know.

One extra thing I remember that was sort of problematic was that when a pod was Terminating it'd get removed from the Endpoints, so any tooling that was using the API info to keep an eye on connections was basically unusable at that point.