Hacker News new | ask | show | jobs
by eggie5 3076 days ago
Does OpenAI train w/ GPUs on k8s clusters?
2 comments

According to the article, they are using NC24 VMS, which have 4 K80s attached. So yes, I would assume they are using GPUs.

Check out https://github.com/google/kubeflow if you are interested in doing the same.

(Disclaimer: I work for GCP doing K8s stuff, I know GKE clusters support GPUs and Kubeflow, not 100% sure if AKS supports it or if you need to set up your own cluster like OpenAI did.)

I have a somewhat off-topic question as a complete TensorFlow beginner and it seems like you'd be in the know:

If I want to train a TF model distributed over many machines in GCP, it seems like I could use Cloud ML Engine or deploy Kubeflow to a K8s cluster running in GKE and train it there.

What should I consider when choosing between these two options? Is there another option I should consider?

RiseML provides a higher-level abstraction than Kubeflow that is more similar to Google Cloud ML. I would love to get your feedback on our solution: https://riseml.com

Btw: we are currently preparing an open-source release

Disclaimer: I am co-founder at RiseML

1. Well deploying a k8s cluster is a huge engineering challenge. 2. And doing distributed training in general is an engineering challenge. 3. And combine the two for an even larger engineering challenge of doing distributed training on k8s using GPUs.

ML Engine is an order of magnitude easier. You just have to do step 2 and setup your model to employ multiple GPUs.

Yep, we've been using GPUs for quite a while (even before the alpha support in Kube), both the K80s in Azure and some Pascals in our own clusters. With the support in Kube now it's quite seamless.
That's some groundbreaking work: GPUs in K8s. By Kube you mean the KubeFlow project?
Late reply, but Kube meant Kubernetes not Kubeflow. Alpha GPU support landed in 1.6 if my memory serves me right. Before that you had to do a bunch of stuff manually to make GPU work, mostly around scheduling etc. Since 1.6, Kubernetes will automatically detect the GPUs on your node and thus correctly assign the workloads where they fit. Kubeflow is an abstraction layer on top of that, that helps a lot when you want to do things such as distributed TensorFlow training. It also helps a bit for simpler jobs by (almost) removing the need to manually mount the NVIDIA drivers from the host into the container for example.