|
|
|
|
|
by crb
3403 days ago
|
|
> Why doesn't Kubernetes destroy these nodes when they've been out of commission for 3-4 hours? Kubernetes isn't responsible for the lifecycle of its nodes. It can run in a DC where "destroying a node" might mean paging a tech to turn off a server. Something external - in your case, kops & your ASG - is responsible for the nodes that Kubernetes runs on. That's a deliberate design choice. It should make a correct decision not to schedule work there, which it sounds like it did. Given that, your other questions are hard to answer. kubelet is a process that runs on the nodes. So is docker. If you can't get into the machine to diagnose the fault, I'd encourage you to set up some monitoring/log shipping off the node so you can see what the state was when it failed. There's nothing inherently "Kubernetes" about this diagnosis - it's more EC2, node/kernel/OS and Docker troubleshooting, in that order. |
|
If you can't get to the machine, there are a million reasons why this would be the case - but ssh is a totally separate process, it's way outside of Kubernetes. VERY commonly, you've run out of memory and processes are fighting among themselves (especially since EVERYTHING seems to be failing), but this is total speculation. OS issues are common too - I've spun up clusters switching from one distro to another, same config, and everything worked great.
Disclosure: I work at Google on Kubernetes.