| This is unrelated to StatefulSets, but I'm going to take the opportunity to ask a Kubernetes engineer for help, since the the kubernetes-users Slack channel sort of feels like shouting into a void. We deploy a small cluster (1 master, 6 nodes) at our startup that started misbehaving last week. All of a sudden three of the nodes went down - one became unresponsive and two had the error "container runtime is down." We couldn't ssh into the unresponsive one, but according to AWS the machine was fine, still receiving network requests and using CPU. Since we couldn't diagnose the issue, we spun up an entirely new cluster using kops, but started seeing the exact same behavior later that night, and again over the weekend. Three nodes were in a not ready state, for the same reasons (unresponsive and container runtime is down). Right now our only solution to solve this issue is to manually terminate the EC2 instances and rely on the Auto-Scaling Group to create new ones. In the mean time, Kubernetes tells us that it can't schedule all of our desired pods, so half of our jobs aren't running, obviously an undesirable situation. A handful of questions I have about the situation: Why are these nodes going down? What causes a node to go unresponsive? Why does the container runtime go down on a node and why doesn't it get restarted? Why doesn't Kubernetes destroy these nodes when they've been out of commission for 3-4 hours? Any help would be appreciated!!! I've been looking through half a dozen log files and gotten zero answers. |
IIRC we improved garbage collection settings in the latest kops (1.5.1), so if you were running out of disk, using the latest kops should fix everything. It's also easy to reconfigure to use a bigger root disk if you're churning through containers faster than GC can keep up. But if it's something else we can try to diagnose it as well!
> Why doesn't Kubernetes destroy these nodes when they've been out of commission for 3-4 hours?
We should, I believe. I actually thought we had an issue for this very problem, though I can't find it. I'll open a new one if I can't track it down. There is maybe an argument that we should fix the root cause, but there's an unlimited number of things that can go wrong, so we need to do both.
(edit: Gave up on finding the existing issue and opened https://github.com/kubernetes/kops/issues/2002 )