Hacker News new | ask | show | jobs
Scaling Kubernetes to 2,500 Nodes (blog.openai.com)
265 points by stanzheng 3072 days ago
9 comments

This is a really fantastic set of general "how to tune kubernetes and the various components for large clusters". Thanks for writing this up!
I'm surprised that the scaling story of k8s/(+etcd?) is still so far behind mesos/zk. There have been mesos clusters at over 10k Nodes for several years now.

I have never personally needed more than a few hundred mesos agents, but these have been added without any noticeable impact on our extremely modestly provisioned (and multi purpose) zk cluster or any other components.

Has anyone used both systems and can speak to any advantages of k8s for these types of workloads?

Also is anyone using some kind of torrent approach as a more reasonable solution to avoid network bottlenecks when distributing big docker images to a large number of nodes?

A lot of the issues were kind of "external" and while worth thinking about for every deployment, not really something the k8s project can do much about other than warn in the documentation.

  - disk latency
  - monitoring queries
  - homemade autoscaler killing all etcd nodes
  - custom scheduling policy moving many kubedns processes to the same node
  - unusually large docker images
  - "sharing" gcr.io request quotas because of Azure NAT IPs
That's not to say that Mesos is not indeed scaling better or easier. I don't know enough about Mesos.
what I find amazing about k8s is that it's one of the first solution that is relativly simple for a small cluster (HA, while schedule stuff on the masters), but can scale amazingly well even for a big cluster. you can start with 3 nodes with like 8gb per machine (or less, I guess even 2gb is feasible if you only want to use like 1-1,5gb of memory per machine). (non ha can of course be smaller)
The Nomad executable is self contained and less than 15MB if I remember correctly. It can be used with Docker containers, shell scripts, or just raw executables.

https://www.hashicorp.com/c1m.html

It was dead simple to install and use compared to my brief experience with k8s.

As a person who doesn’t understand containers, where do I go to learn the basics?
If you want to avoid doing a bunch of research on your own I've put together an up to date self paced video course at https://diveintodocker.com/.

It covers everything from "What is Docker?" to learning how to apply it to your own projects. There's a tiny bit of theory, followed by lots of guided labs and examples.

In case you're curious, I've been using Docker in development and production since 2014 and am also a Docker Captain (TL;DR is Docker reached out to me to join their team as a trusted content provider).

We have Docker 101 [0] and Kubernetes 101 [1] tutorials on our github repo which may be helpful, assuming you have an understanding of Linux. Or checkout the resources on Docker's website [2].

[0] https://github.com/gravitational/workshop/blob/master/docker...

[1] https://github.com/gravitational/workshop/blob/master/k8s101...

[2] https://www.docker.com/what-docker

DISCLAIMER: I'm a Consulting Architect at Red Hat in our Container and PaaS Practice Group. An expert, knowledge building, & community building/supporting group. A Community of Practice, we call it.

We (Red Hat) make the following references (beyond what our training & docs provide) available to our consultants, customers, and world at-large. BTW, if something isn't clear, is wrong, or you want to discuss a point, reach out to us on GitHub. Just about all of our products, software, and documentation are up on GitHub.

I'd also recommend playing with Minishift or Minikube. Great way to put a quick sandbox on your laptop.

The source GitHub Repo: https://github.com/redhat-cop/openshift-playbooks

Building Blocks of OpenShift (& Kubernetes): http://v1.uncontained.io/playbooks/fundamentals/building_blo...

Docker Fundamentals Reference: http://v1.uncontained.io/playbooks/fundamentals/docker_refer...

Minishift: https://docs.openshift.org/latest/minishift/getting-started/...

The defining characteristic of containers are constrained processes (constrained by one or more of memory, cpu, access controls..); isolated namespaces (your own tmp, root, users, process list) plus some (possibly limited) additional isolation from the rest of the hosts files and hardware.

Docker adds to this 1. a packaging that lets you define what goes into a container (Dockerfile) and format (docker image) - which can be downloaded, extracted, manipulated and uploaded. 2. a way to stop (freeze) and start (thaw) a container 3. tools for controlling network capabilities within a container and between the host and container or other containers on other hosts.

Those are the essentials.

Kubernetes (and other tools) expand on this in terms of orchestration -- especially the internetworking aspect but also failover and load balancing.

Docker has a good overview in my opinion:

https://www.docker.com/what-container

At the fundamental layer, start with cgroups and namespaces. If you're looking for practical knowledge, start with making and running a Docker image - usually a basic "hello world" web server (nginx) is a good one to start on :)
Here's a Docker tutorial I wrote targeted at beginners: https://docker-curriculum.com/
Once you understand the basics of Docker, which is containers, then look at Kubernetes. One place you might start is with spinning up a single VM with RancherOS.
350TB of memory, and 50,000 cores, nice.

ARP caching seems to be a common issue in cloud environments. AWS recommends turning it off and does so itself in their Amazon Linux distro.

Ran into the ARP scale issues when trying to put 1000 containers on a system for scale testing over year ago. strace helped figure out where the issues was and what settings to change. I guess I should have sent an email to the mailing list. At that time if you searched for scaling to 1000 docker contains was a failed search, as it was "hey here is how I scaled to 1000 containers over X numbers of nodes". No one was crazy enough to try to get 1000 on a single machine.
Does OpenAI train w/ GPUs on k8s clusters?
According to the article, they are using NC24 VMS, which have 4 K80s attached. So yes, I would assume they are using GPUs.

Check out https://github.com/google/kubeflow if you are interested in doing the same.

(Disclaimer: I work for GCP doing K8s stuff, I know GKE clusters support GPUs and Kubeflow, not 100% sure if AKS supports it or if you need to set up your own cluster like OpenAI did.)

I have a somewhat off-topic question as a complete TensorFlow beginner and it seems like you'd be in the know:

If I want to train a TF model distributed over many machines in GCP, it seems like I could use Cloud ML Engine or deploy Kubeflow to a K8s cluster running in GKE and train it there.

What should I consider when choosing between these two options? Is there another option I should consider?

RiseML provides a higher-level abstraction than Kubeflow that is more similar to Google Cloud ML. I would love to get your feedback on our solution: https://riseml.com

Btw: we are currently preparing an open-source release

Disclaimer: I am co-founder at RiseML

1. Well deploying a k8s cluster is a huge engineering challenge. 2. And doing distributed training in general is an engineering challenge. 3. And combine the two for an even larger engineering challenge of doing distributed training on k8s using GPUs.

ML Engine is an order of magnitude easier. You just have to do step 2 and setup your model to employ multiple GPUs.

Yep, we've been using GPUs for quite a while (even before the alpha support in Kube), both the K80s in Azure and some Pascals in our own clusters. With the support in Kube now it's quite seamless.
That's some groundbreaking work: GPUs in K8s. By Kube you mean the KubeFlow project?
Late reply, but Kube meant Kubernetes not Kubeflow. Alpha GPU support landed in 1.6 if my memory serves me right. Before that you had to do a bunch of stuff manually to make GPU work, mostly around scheduling etc. Since 1.6, Kubernetes will automatically detect the GPUs on your node and thus correctly assign the workloads where they fit. Kubeflow is an abstraction layer on top of that, that helps a lot when you want to do things such as distributed TensorFlow training. It also helps a bit for simpler jobs by (almost) removing the need to manually mount the NVIDIA drivers from the host into the container for example.
Isn’t it a problem to have etcd store its state on a non persistent volume?

How do they recover it after a restart? I suppose it's not a manual process.

The replacement machine will start pulling its data from the remaining nodes when it joins the cluster. However, it's recommended to migrate the failed node's data first if it's greater than 50MB: https://github.com/coreos/etcd/blob/master/Documentation/op-...
(Azure containers lead here) Awesome to see OpenAI scale Kubernetes on Azure!
(in Azure) Are VMs still tied to a specific machine and thus if the machine goes down the VMs need be restarted?
No. I work exclusively with scaling Linux workloads in Azure and local failover happens fairly regularly without the user ever seeing any indication. The hypervisor has its own cache which is tied more to your storage acct & runtime data-disks than the VM itself. This is true even with the lowest cost storage option: LRS standard blob-storage.

Howwwwever, LRS storage does not save you if the whole datacenter goes down, or during scheduled maintenance (when the whole datacenter is down.) For that, you'll need ZRS (which does failover to a co-lo in the event of the primary datacenter going down) or GRS (for which you can configure/test your failover options.)

Also, Microsoft's strength is in their PaaS services, like app-service or Azure-Functions. Those usually have CosmosDB on the backend, which is pretty much the best failover/DB-availability server-software on the market in my opinion.

VMs in all clouds are always tied to specific machines. If that machine fails unexpectedly then those VMs will restart. If it is a controlled reboot (e.g. host update) then they may not restart...
On Google's cloud, at least, VMs don't get killed for planned hypervisor downtime, which is a potential differentiator for them. Of course, this should matter a lot less for workloads that have been explicitly designed run on Kubernetes...
Well, at least in Google Cloud for planned updates you can get your VM host migrated and not lose a Node due to a planned maintenance. I am not aware if Azure supports this, but my guess is they do not.
They (Azure) do for most operations.

https://docs.microsoft.com/en-us/azure/virtual-machines/wind...

I've found actual reboots to be rare - the exception being the recent Spectre / Meltdown patching.