Hacker News new | ask | show | jobs
by frakkingcylons 3072 days ago
I have a somewhat off-topic question as a complete TensorFlow beginner and it seems like you'd be in the know:

If I want to train a TF model distributed over many machines in GCP, it seems like I could use Cloud ML Engine or deploy Kubeflow to a K8s cluster running in GKE and train it there.

What should I consider when choosing between these two options? Is there another option I should consider?

2 comments

RiseML provides a higher-level abstraction than Kubeflow that is more similar to Google Cloud ML. I would love to get your feedback on our solution: https://riseml.com

Btw: we are currently preparing an open-source release

Disclaimer: I am co-founder at RiseML

1. Well deploying a k8s cluster is a huge engineering challenge. 2. And doing distributed training in general is an engineering challenge. 3. And combine the two for an even larger engineering challenge of doing distributed training on k8s using GPUs.

ML Engine is an order of magnitude easier. You just have to do step 2 and setup your model to employ multiple GPUs.