Hacker News new | ask | show | jobs
by stephenorr 981 days ago
Still going on this morning.

If you're experiencing difficulties, the fix for us was to create a new node pool, then force kill the individual nodes in the previous pool. That forces k8s to move the workloads to the new pool.

You can't rely on the workloads moving themselves, because there's some weirdness around Cilium that's preventing them from being cleaned up properly. I'm not an expert on why, but the operator looks to be having trouble connecting to the Cilium daemon.

I've had to do this at least twice now, and DO Support don't seem to be any closer to a resolution. I've moved to a fixed-size node pool now to see if the cluster autoscaling is part of the problem.

1 comments

Not knowing the details, I wonder if this incident could be Kubernetes' Waterloo. All the supposed benefits of self-healing and management are the only thing bolstering its use, far as I understand.
It turned out that DO's base images for the worker nodes had automatic updates turned on; these were kicking in at around 6am each day, and causing the nodes to fail.

I'm sure their official incident report will have more details, but right now it looks like this is nothing to do with Kubernetes directly, but the underlying OS.

DO have now disabled those automatic updates so that this stops happening; it's been stable since then.

Interesting, thanks.