| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by minimaxir 2927 days ago

With the new discounts on preemptible GPUs (https://cloudplatform.googleblog.com/2018/06/Introducing-imp...), the economics of quickly spinning up a fleet of GPUs with Kubernetes for a quick parallelizable ML task become very interesting. (assuming that Google allows enough GPU quota for a fleet of GPUs for nonenterprise users anyways)

What I want to use Kubernetes + instant-GPU-fleet for deep learning hyperparameter grid searching. (i.e. spin up a lot of preemptible GPUs; for each parameter config, train the model on a single GPU in parallel for linear scanning speed scaling).

Kubeflow (https://github.com/kubeflow/kubeflow) is close to this functionality, but not quite there yet in user-friendlyness. (you have to package everything in a huge Docker container and launch jobs from the CLI; ideally what I want to do is to spawn containers and start training directly from the JupyterHub notebook on the master node)

3 comments

boulos 2927 days ago

Disclosure: I work on Google Cloud (and helped launch Preemptible VMs).

> (assuming that Google allows enough GPU quota for a fleet of GPUs for nonenterprise users anyways

This is actually why we have separate preemptible quota [1], which we grant more freely. You can't stock out our full-price customers, so we're happy to let you spin up tons of V100s (and as of this morning TPUs!).

[1] https://cloud.google.com/compute/quotas#quotas_for_preemptib...

link

minimaxir 2927 days ago

I was wondering why there was a separate quota. That makes much more sense!

link

jamesblonde 2927 days ago

If you are adventurous enough to try a more user-friendly platform for Distributed TensorFlow (Horovod, grid-search using Spark/TensorFlow), have a look at http://www.hops.io (watch video on front page). In Hops, you don't program infrastructure (no YAML file, no Dockerfiles). You can run with 100s of GPUs for grid search using this python code:

  def train(param1, param2, ..):
    # tf code here (import any libraries here)

  dict = { 'lr': [0.001, 0.0001, 0.00001], 'dropout': [0.4, 0.6, 0.8]}
  experiment.launch(spark, train, dict)

The reason this works on clusters is that there is a persistent conda environment installed on all hosts in the cluster for every 'project' (think Github project). The above function 'train' is a pyspark mapper and the python libraries needed in it are found in the local conda environment. You install conda libraries by search/click in a UI, not by writing a Dockerfile. You create your cluster by selecting how many GPUs/memory you need in a UI, not by writing a YML file.

Diclosure: i work on this project.

link

w1nk 2927 days ago

Out of curiosity, why are you grid searching if you have access to google's infrastructure (vizier, etc).

link

minimaxir 2927 days ago

You mean Cloud ML Engine? (https://cloud.google.com/ml-engine/)

Cloud ML Engine is a bit opaque in terms of price efficiency (how powerful is a "training unit"?) but I suppose that's another option.

link

w1nk 2927 days ago

Hm, actually sorry I assumed they had made vizier available as a blackbox service (https://static.googleusercontent.com/media/research.google.c...). If they haven't, you can save money/time by employing better than grid search. There are some re implementations of vizier available, I'll see if I can make ours too (email in profile). I do agree with that billing can be..complex ;)

link

minimaxir 2927 days ago

From the paper synopsis, it's implemented as a part of HyperTune. (https://cloud.google.com/ml-engine/docs/tensorflow/using-hyp...)

link