Hacker News new | ask | show | jobs
by ovidiup 3407 days ago
I built Jollyturns (https://jollyturns.com) which has a fairly large server-side component. Before starting the work on Jollyturns I worked at Google (I left in 2010) on a variety of infrastructure projects, where I got to use first hand Borg and all the other services running on it.

When I started more than 5 years ago few of the options mentioned in the article were available, so I just used Xen running on bare metal. I just wanted to get things done as opposed to forever experimenting with infrastructure. Not to mention the load was inexistent, so everything was easy to manage.

After I launched the mobile app, I decided to spend some time on the infrastructure. 2 years ago I experimented with Marathon, which was originally developed at Twitter, probably by a bunch of former Google employees. The reason is Marathon felt very much like Borg: you could see the jobs you launched in a pretty nice Web interface, including their log files, resource utilization and so on. Deploying it on bare metal machines however was an exercise in frustration since Marathon relied heavily on Apache Mesos. For a former Googler, Mesos had some weird terminology and was truly difficult to understand what the heck was going on. Running Marathon on top of it had major challenges: when long-running services failed you could not figure out why things were not working. So after about 2 weeks I gave up on it, and went back to manually managed Xen VMs.

Around the same time I spent few days with Docker Swarm. If you're familiar with Borg/Kubernetes, Swarm is very different from it since - at least when it started, you had to allocate services to machines by hand. I wrote it off quickly since allocating dockerized services on physical machines was not my idea of cluster management.

Last year I switched to Kubernetes (version 1.2) since it's the closest to what I expect from a cluster management system. The version I've been using in production has a lot of issues: high availability (HA) for its components is almost non-existent. I had to setup Kube in such a way that its components have some resemblance of HA. In the default configuration, Kubernetes installs its control components on a single machine. If that machine fails or is rebooted your entire cluster disappears.

Even with all these issues, Kubernetes solves a lot of problems for you. The flannel networking infrastructure greatly simplifies the deployment of docker containers, since you don't need to worry about routing traffic between your containers.

Even now Kubernetes doesn't do HA:

https://github.com/kubernetes/kubernetes/issues/26852

https://github.com/kubernetes/kubernetes/issues/18174

Don't be fooled by the title of the bug report, the same component used in kubectl to implement HA could be used inside Kube' servers for the same reason. I guess these days the Google engineers working on Kubernetes have no real experience deploying large services on Borg inside Google. Such a pity!

1 comments

FYI, Kubernetes can do HA: the issues you pointed to are both about client connectivity to multiple API servers, which is typically solved today either using a load-balancer, or by using DNS with multiple A records (which actually works surprisingly well with go clients). The issues you pointed to are about a potential third way, where you would configure a client with multiple server names/addresses, and the client would failover between them.
As I said, it's not only kubectl that has the problem. None of the services implemented by Kubernetes are HA: kubelet, proxy, and the scheduler. For a robust deployment you need these replicated.

Using DNS might work for the K8s services, but at least in version 1.2, SkyDNS was an add-on to Kubernetes. This should really be part of the deployed K8s services. Hopefully newer versions fixed that, I didn't check.

Preferably, the base K8s services implement HA natively. Deploying a separate load balancer is just a workaround around the problem.

FYI Google's Borg internal services implement HA natively. Seems to me the Kubernetes team just wanted to build something quick, and never got around to doing the right thing. But I think it's about time they do it.

I think this was true around kubernetes 1.2, but is no longer the case. etcd is natively HA. kube-apiserver is effectively stateless by virtue of storing state in etcd, so you can run multiple copies for HA. kube-scheduler & kube-controller-manager have control loops that assume they are the sole controller, so they use leader-election backed by etcd: for HA you run multiple copies and they fail-over automatically. kubelet & kube-proxy run per-node so the required HA behaviour is simply that they connect to a different apiserver in the event of failure (via load-balancer or DNS, as you prefer).

kube-dns is an application on k8s, so it uses scale-out and k8s services for HA, like applications do. And I agree that it is important, I don't know of any installations that don't include it.

I think the right things have been built. We do need to do a better job documenting this though!

Great, thanks for the update! I'll update my deployment towards the end of spring, hopefully that's not going to be too painful.
etcd itself cannot be horizontally scaled because of the architecture. etcd's leader model cannot allow you to go beyond a certain number of nodes in cluster. The leader would be overloaded.
I think federation allows to scale horizontally above the limitation of a single etcd cluster. OTOH The fact that zk/etcd/consul are all leader-based is probably the reason flynn "simply" uses postgres