Hacker News new | ask | show | jobs
by PaulJulius 3401 days ago
This is unrelated to StatefulSets, but I'm going to take the opportunity to ask a Kubernetes engineer for help, since the the kubernetes-users Slack channel sort of feels like shouting into a void.

We deploy a small cluster (1 master, 6 nodes) at our startup that started misbehaving last week. All of a sudden three of the nodes went down - one became unresponsive and two had the error "container runtime is down." We couldn't ssh into the unresponsive one, but according to AWS the machine was fine, still receiving network requests and using CPU.

Since we couldn't diagnose the issue, we spun up an entirely new cluster using kops, but started seeing the exact same behavior later that night, and again over the weekend. Three nodes were in a not ready state, for the same reasons (unresponsive and container runtime is down). Right now our only solution to solve this issue is to manually terminate the EC2 instances and rely on the Auto-Scaling Group to create new ones. In the mean time, Kubernetes tells us that it can't schedule all of our desired pods, so half of our jobs aren't running, obviously an undesirable situation.

A handful of questions I have about the situation: Why are these nodes going down? What causes a node to go unresponsive? Why does the container runtime go down on a node and why doesn't it get restarted? Why doesn't Kubernetes destroy these nodes when they've been out of commission for 3-4 hours?

Any help would be appreciated!!! I've been looking through half a dozen log files and gotten zero answers.

3 comments

So first, sorry about the problem. Please come hang out in the sig-aws or kops channels - we're a bit smaller and more focused than kubernetes-users, and can typically get these problems solved pretty quickly together.

IIRC we improved garbage collection settings in the latest kops (1.5.1), so if you were running out of disk, using the latest kops should fix everything. It's also easy to reconfigure to use a bigger root disk if you're churning through containers faster than GC can keep up. But if it's something else we can try to diagnose it as well!

> Why doesn't Kubernetes destroy these nodes when they've been out of commission for 3-4 hours?

We should, I believe. I actually thought we had an issue for this very problem, though I can't find it. I'll open a new one if I can't track it down. There is maybe an argument that we should fix the root cause, but there's an unlimited number of things that can go wrong, so we need to do both.

(edit: Gave up on finding the existing issue and opened https://github.com/kubernetes/kops/issues/2002 )

I ran into something very similar with a cluster almost identical to you. Turns out the default disk size for kops is 25G and when your masters run out of space things start to die with almost no way of telling why.

I rerolled with 100G and I've seen zero problems since.

> Why doesn't Kubernetes destroy these nodes when they've been out of commission for 3-4 hours?

Kubernetes isn't responsible for the lifecycle of its nodes. It can run in a DC where "destroying a node" might mean paging a tech to turn off a server. Something external - in your case, kops & your ASG - is responsible for the nodes that Kubernetes runs on. That's a deliberate design choice.

It should make a correct decision not to schedule work there, which it sounds like it did.

Given that, your other questions are hard to answer. kubelet is a process that runs on the nodes. So is docker. If you can't get into the machine to diagnose the fault, I'd encourage you to set up some monitoring/log shipping off the node so you can see what the state was when it failed.

There's nothing inherently "Kubernetes" about this diagnosis - it's more EC2, node/kernel/OS and Docker troubleshooting, in that order.

Correct, Kubernetes is not responsible for the nodes. I would build a health check into your Autoscale Group (I don't know exactly how to do this on AWS, but am happy to show you an example on GCP - aronchick (at) google).

If you can't get to the machine, there are a million reasons why this would be the case - but ssh is a totally separate process, it's way outside of Kubernetes. VERY commonly, you've run out of memory and processes are fighting among themselves (especially since EVERYTHING seems to be failing), but this is total speculation. OS issues are common too - I've spun up clusters switching from one distro to another, same config, and everything worked great.

Disclosure: I work at Google on Kubernetes.

Speaking of distros and considering your background. What would be the "best" Distro for running Kubernetes ?
If all the OS does is provide a minimal surface for running containers, I'd focus on whatever gives me the best security, manageability and updates.

The Container Optimised OS is what GKE uses on Google Cloud Platform https://cloud.google.com/container-optimized-os/docs/

It's conceptually very similar to CoreOS' Container Linux, so I might try that if I were looking at Kubernetes elsewhere and wanted a container-only OS.

If I am running an environment with multiple purposes - some container hosts, some regular machines - I'd err on the side of "who is my current vendor/what does my ops team support and know best".

Great thanks for the valuable infos. We are running SLES12 and also a Suse Openstack Cloud on bare metal and only recently Suse has announced their container strategy (SLE MicroOS Distro) but we haven't had time to evaluate it yet. At a recent DevConf I saw some interesting talks about immutable container hosts such Fedora Atomic. Seems that there is a lot of work done in this area.