Hacker News new | ask | show | jobs
by minimaxir 2927 days ago
With the new discounts on preemptible GPUs (https://cloudplatform.googleblog.com/2018/06/Introducing-imp...), the economics of quickly spinning up a fleet of GPUs with Kubernetes for a quick parallelizable ML task become very interesting. (assuming that Google allows enough GPU quota for a fleet of GPUs for nonenterprise users anyways)

What I want to use Kubernetes + instant-GPU-fleet for deep learning hyperparameter grid searching. (i.e. spin up a lot of preemptible GPUs; for each parameter config, train the model on a single GPU in parallel for linear scanning speed scaling).

Kubeflow (https://github.com/kubeflow/kubeflow) is close to this functionality, but not quite there yet in user-friendlyness. (you have to package everything in a huge Docker container and launch jobs from the CLI; ideally what I want to do is to spawn containers and start training directly from the JupyterHub notebook on the master node)

3 comments

Disclosure: I work on Google Cloud (and helped launch Preemptible VMs).

> (assuming that Google allows enough GPU quota for a fleet of GPUs for nonenterprise users anyways

This is actually why we have separate preemptible quota [1], which we grant more freely. You can't stock out our full-price customers, so we're happy to let you spin up tons of V100s (and as of this morning TPUs!).

[1] https://cloud.google.com/compute/quotas#quotas_for_preemptib...

I was wondering why there was a separate quota. That makes much more sense!
If you are adventurous enough to try a more user-friendly platform for Distributed TensorFlow (Horovod, grid-search using Spark/TensorFlow), have a look at http://www.hops.io (watch video on front page). In Hops, you don't program infrastructure (no YAML file, no Dockerfiles). You can run with 100s of GPUs for grid search using this python code:

  def train(param1, param2, ..):
    # tf code here (import any libraries here)

  dict = { 'lr': [0.001, 0.0001, 0.00001], 'dropout': [0.4, 0.6, 0.8]}
  experiment.launch(spark, train, dict)
The reason this works on clusters is that there is a persistent conda environment installed on all hosts in the cluster for every 'project' (think Github project). The above function 'train' is a pyspark mapper and the python libraries needed in it are found in the local conda environment. You install conda libraries by search/click in a UI, not by writing a Dockerfile. You create your cluster by selecting how many GPUs/memory you need in a UI, not by writing a YML file.

Diclosure: i work on this project.

Out of curiosity, why are you grid searching if you have access to google's infrastructure (vizier, etc).
You mean Cloud ML Engine? (https://cloud.google.com/ml-engine/)

Cloud ML Engine is a bit opaque in terms of price efficiency (how powerful is a "training unit"?) but I suppose that's another option.

Hm, actually sorry I assumed they had made vizier available as a blackbox service (https://static.googleusercontent.com/media/research.google.c...). If they haven't, you can save money/time by employing better than grid search. There are some re implementations of vizier available, I'll see if I can make ours too (email in profile). I do agree with that billing can be..complex ;)
From the paper synopsis, it's implemented as a part of HyperTune. (https://cloud.google.com/ml-engine/docs/tensorflow/using-hyp...)