Hacker News new | ask | show | jobs
by tromobne8vb 3654 days ago
I have read that. I even re-read it before making my post.

That implementation requires starting individual tasks on each node in your cluster.

>To create a cluster, you start one TensorFlow server per task in the cluster. Each task typically runs on a different machine, but you can run multiple tasks on the same machine (e.g. to control different GPU devices).

I'm used to using tools that can roll out to a cluster with more finesse than that. The Spark wrapper seems to provide some capability to do this automatically, but even the Spark wrapper requires installing python libraries on each node.

3 comments

Yeah, I'm trying to figure this out too. TensorFlow needs Yarn support. Ideally, Yarn would allocate resources and inform the processes of the various workers in the graph, etc. etc. I see that as the harder part. If you use mesos, then there is some preliminary support for that. https://github.com/tensorflow/tensorflow/issues/1996

Since TensorFlow has native dependencies on CUDA stuff for GPU support, I don't think there's much of a way to get around installing things on every machine. You might be able to package a python env without CUDA for spark to run using conda. Here's an interesting blog post about that: https://www.continuum.io/blog/developer-blog/conda-spark

But I'm not sure I see the point in running TensorFlow without GPU support. And if you're hoping to run GPU machines on an existing spark cluster and intelligently allocate the GPU stuff to the right machine. . . that's gonna be tough. Here's an interesting talk on that from the last spark summit: https://www.youtube.com/watch?v=k6IOWblLQK8&feature=youtu.be

Ultimately, you're probably better off just running your own gpu cluster strictly for your TensorFlow model on ephemeral AWS spot instances.

Or just use Google Cloud Machine Learning. That's what Google wants and expects you to do anyway. Borg is the Borg. You will be assimilated.

TensorFlow without GPU support is very useful for inference.
I think what is going on here is that what we see as complications are actually features, but that doesn't become clear until you are operating with your NNs in production, at scale.

What you want to be able to do is control which devices (CPUs, GPUs or co-processors[1]) execute which part of your model (eg, GPU for training, co-processors for inference, who knows what else).

Yahoo released some code to deal with similar issues, but with Caffe on YARN[2].

[1] https://cloudplatform.googleblog.com/2016/05/Google-supercha...

[2] http://yahoohadoop.tumblr.com/post/129872361846/large-scale-...