|
|
|
|
|
by jamesblonde
2929 days ago
|
|
If you are adventurous enough to try a more user-friendly platform for Distributed TensorFlow (Horovod, grid-search using Spark/TensorFlow), have a look at http://www.hops.io (watch video on front page).
In Hops, you don't program infrastructure (no YAML file, no Dockerfiles). You can run with 100s of GPUs for grid search using this python code: def train(param1, param2, ..):
# tf code here (import any libraries here)
dict = { 'lr': [0.001, 0.0001, 0.00001], 'dropout': [0.4, 0.6, 0.8]}
experiment.launch(spark, train, dict)
The reason this works on clusters is that there is a persistent conda environment installed on all hosts in the cluster for every 'project' (think Github project). The above function 'train' is a pyspark mapper and the python libraries needed in it are found in the local conda environment. You install conda libraries by search/click in a UI, not by writing a Dockerfile. You create your cluster by selecting how many GPUs/memory you need in a UI, not by writing a YML file.Diclosure: i work on this project. |
|