Hacker News new | ask | show | jobs
by thesandlord 3072 days ago
According to the article, they are using NC24 VMS, which have 4 K80s attached. So yes, I would assume they are using GPUs.

Check out https://github.com/google/kubeflow if you are interested in doing the same.

(Disclaimer: I work for GCP doing K8s stuff, I know GKE clusters support GPUs and Kubeflow, not 100% sure if AKS supports it or if you need to set up your own cluster like OpenAI did.)

1 comments

I have a somewhat off-topic question as a complete TensorFlow beginner and it seems like you'd be in the know:

If I want to train a TF model distributed over many machines in GCP, it seems like I could use Cloud ML Engine or deploy Kubeflow to a K8s cluster running in GKE and train it there.

What should I consider when choosing between these two options? Is there another option I should consider?

RiseML provides a higher-level abstraction than Kubeflow that is more similar to Google Cloud ML. I would love to get your feedback on our solution: https://riseml.com

Btw: we are currently preparing an open-source release

Disclaimer: I am co-founder at RiseML

1. Well deploying a k8s cluster is a huge engineering challenge. 2. And doing distributed training in general is an engineering challenge. 3. And combine the two for an even larger engineering challenge of doing distributed training on k8s using GPUs.

ML Engine is an order of magnitude easier. You just have to do step 2 and setup your model to employ multiple GPUs.