Hacker News new | ask | show | jobs
by imsky 3274 days ago
We've been running a small Kubernetes cluster of < 30 nodes that handles a variety of workloads using kops for almost a year now. kops is a significant improvement over other provisioning tools like kube-up.sh and kube-aws and has simplified infrastructure management a great deal. We can provision a brand new cluster and a couple dozen services across multiple namespaces in less than an hour - kops helps a lot with making that process smooth and reliable.

We have run into some issues with kops. Customizing the Kubernetes executables, e.g. using a particular Docker version or storage driver, has been buggy pre-1.5. Upgrading clusters to later Kubernetes versions has left some of the kube-system services, like kube-dns, in a weird state. Occasionally we encounter issues with pods failing to schedule/volumes failing to mount - these are fixed by either restarting the Kubernetes (systemd) services on the problem nodes or by reprovisioning nodes entirely. On one occasion, a bad kops cluster update left our networking in an unrecoverable state (and our cluster inaccessible).

I don't think there are any missing pieces, the initial configuration is what usually takes the most time to set up. You'll have to become familiar with the kops source as not everything is documented. As far as running 30 clusters with a 2-person team, it's definitely feasible, just complicated when you're constantly switching between clusters.

1 comments

Definitely some great feedback there - I think most of those are known issues, and not all of them are technically kops issues, but we'll be figuring out how to work around them for kops users. (Switching Docker versions is tricky because k8s is tested with a particular version, so we've been reluctant to make that too easy, and the kube-dns 1.5 -> 1.6 upgrade was particularly problematic). Do file issues for the hung nodes - it might be k8s not kops, but one of the benefit of kops is that it takes a lot of unknowns out of the equation so we can typically reproduce and track things down faster.

And it is way too hard to switch clusters with kubectl, I agree. I tend to use separate kubeconfig files, and use `export KUBECONFIG=<path>`, but I do hope we can find something better!

Right, the hung nodes issue is probably least related to kops (though it'd be great if in the future, kops could leverage something like node-problem-detector to mitigate similar issues). Of the other issues, the incorrectly applied cluster config (kops decided to update certs for all nodes and messed them up in the process, then proceeded to mess up the Route53 records for the cluster) is the most serious one, and also not likely easy to reproduce. Apart from that, kops has been an excellent tool and we've been very pleased with it.