| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by bitL 2918 days ago
	Try to run multiple models/ensemble training on many computers with many GPUs to pick up the best performing model or combo. TensorFlow so far has probably the easiest approach for it. That might be reason for the attitude "real deep learning engineers use Tensorflow", as other approaches either don't scale that well or you can't even model something you need for your bleeding-edge billion $-making approach, despite other frameworks being much much simpler/more natural and a joy to use.

2 comments

mlthoughts2018 2918 days ago

Most of the tools that TensorFlow offers for multi-gpu and distributed model training will "just work" directly with Keras models too, or with really minor tweaks. You can even easily mix and match pure TensorFlow code (like explicitly setting the device with a device placement context manager) with Keras code.

See e.g. [0] and [1] linked below.

For model ensembling, it's even easier. After training, in Keras you could simply load your multiple models and create a new Model() object that does nothing but use a merge layer (with mode set to averaging) to average across multiple input models, even if the models share layers or have other crazy constraints. Writing that final ensemble is extremely easy in Keras.

In my experience researching and productionizing very deep Keras models for an image processing use case that has moderately tight performance constraints, Keras has proved to scale extremely well and the code remains dead simple the whole time.

[0]: < https://blog.keras.io/keras-as-a-simplified-interface-to-ten... >

[1]: < https://www.tensorflow.org/programmers_guide/estimators#crea... >

link

bitL 2918 days ago

Thanks for the links! Do you know how to convert a generator from Keras to an input in estimator, add class weights, custom loss functions, plug-in various Keras-based callbacks as well? I couldn't find any guide for that part.

What do you use to orchestrate distributed training in Keras?

link

mlthoughts2018 2918 days ago

> "how to convert a generator from Keras to an input in estimator"

This is a bit of a mistaken question, because you would not "convert" a DataGenerator into an estimator input. Instead, you can just wrap the DataGenerator in a simple function that lazily outputs the next batch of training examples. Input functions for Estimators are just functions that accept no arguments and produce a 2-tuple, with first component of a dictionary of named inputs and second component of the target value. You can write your own wrapper functions that consumes from a DataGenerator and normalized the output to the format. I'm sure there will be a helper function to do this automatically in the future, but it's about as easy as can be to just wrap with a function anyway.

> "add class weights, custom loss functions"

This too seems mistaken, because this is part of the compiled Keras model, before ever converting anything to TensorFlow Estimator. You can use whatever you want for this and the Keras Model.compile function accepts dictionaries for loss and loss_weights, as well as custom add_loss usage in your own layers (even pass through layers that don't affect the computation graph).

> "plug-in various Keras-based callbacks as well"

This is admittedly slightly harder, but I think it's a little bit of an unfair question because Keras offers far more functionality in its Callbacks than TensorFlow offers with predefined hooks. "Penalizing" Keras because TensorFlow offers less functionality doesn't seem right.

Either way, this is also not too hard. For any Callback you want to use from Keras, you basically just write a tiny wrapper class that subclasses from session_run_hook.SessionRunHook from tensorflow, and then maps the TensorFlow naming conventions, like "begin" or "before_run" etc., to wrap the equivalent method from the Keras callback, like "on_train_begin", or "on_epoch_end".

The bigger point is that this headache is because of TensorFlow. Both because TF chose a really silly class design for the SessionRunHooks thing, making automatic conversion from Keras (which has the more established set of pre-existing callbacks) harder for no good reason, and also because TensorFlow lacks functionality that Keras gives you for free.

For orchestration, my team just uses a simple GPU cluster where the native device placement primitives with TensorFlow allow us to scale to as many GPUs as we've needed (max in the dozens).

For distributing and orchestrating over larger clusters, Keras provides some good alternatives right on its own FAQ page:

< https://keras.io/why-use-keras/#keras-has-strong-multi-gpu-s... >

In the end, I would not claim you can immediately translate every complex feature of Keras, like deep custom callbacks or something, over to TensorFlow ... but that's usually not a big deal. Most times, you just want to port a fairly standard model to the Estimator API, and for this, it "just works" directly and is easy to use for local, small-ish clusters of GPUs.

When you have a much rarer problem that needs a huge GPU cluster, then use the other suggests like dist-keras or Horovod, or write your own simple map-reduce-ish wrapper to put data on different nodes and deploy e.g. a containerized training application.

Also people need to definitely keep in mind that most of the limitations are TensorFlow's own fault for not designing things to be compatible with heavily used Keras features like Callbacks out of the box. TensorFlow has a history of doing this, and has been very developer-unfriendly in this way even when it has no downside or impact on performance or anything. The core TensorFlow designs suffer from an unfortunate "not invented here" kind of philosophy, even when dealing with Keras.

link

bitL 2917 days ago

> Instead, you can just wrap the DataGenerator in a simple function that lazily outputs the next batch of training examples.

You probably know those simple generators aren't recommended to be used by Keras, instead keras.utils.Sequence is preferred due to (Keras doc): "Sequence are a safer way to do multiprocessing. This structure guarantees that the network will only train once on each sample per epoch which is not the case with generators."

I couldn't see any equivalent of this for estimators, sadly, and wrapping it up in a naive generator seemed like a functionality downgrade.

> because this is part of the compiled Keras model, before ever converting anything to TensorFlow Estimator

Right, you specify a loss function before compiling, however if it is a custom one and you for some reason need to reload a model snapshot (i.e. resuming training), you need to provide it separately or the loading fails. I haven't found any docs on this. Imagine your training optimizer automatically generating loss functions by means of function composition, e.g. you put a mix of +-*/,log,exp,tanh etc. based off some past training experience of what helped in individual cases/literature, then taking 1000s of these loss functions and pushing them to a large cluster where they are scored on how well did they perform, keeping only the best performing ones.

Class weights are specified in fit_generator(), not in compile time; again, here I couldn't find any description on how to convert Keras' weight dictionary to what TensorFlow needs.

> Callbacks... "Penalizing" Keras because TensorFlow offers less functionality doesn't seem right.

The thing here is that some of those callbacks are mandatory for a training to converge, e.g. decreasing learning rate, escaping plateau situations, computing various stats that aren't provided by Keras (outside loss/accuracy; you might want F1, Fleiss/Cohen's Kappa, Matthews correlation coefficient, AUC ROC etc.) that might be decisive for keeping/discarding a model; then also multi GPU callbacks; some people even use callbacks to perform the whole distributed computation as well. In my examples, if I remove any of those callbacks, my models won't achieve any kind of usability but with those callbacks I match world-class results. I couldn't find any non-insane way to map them to TensorFlow prior to our conversation.

As I mentioned, I have a very large cluster, each node with multiple GPUs, so I need an orchestration on both hyperparameters/loss functions per node as well as within each node to run on multiple GPUs.

The page from Keras you mentioned was precisely my starting point and from those tf.Estimator seemed the last devops-intense way to go (Horovod needs MPI and CERNDB/Keras Spark).

I'll take a deeper look into SessionRunHook you mentioned - thanks! ;-)

link

mlthoughts2018 2917 days ago

For class weights, the easiest thing is to just generate that as another one of the items placed into the input function dictionary, e.g. when you wrap the DataGenerator. Then have a custom loss function that takes this input element and applies the weight for that training sample. Again, the need to do slight extra work is a limitation of TensorFlow here, not of Keras, but because Keras is so flexible, it's super easy to work around.

> "computing various stats that aren't provided by Keras.."

It seems like you have this backward. Keras provides the easy interface to create the custom callbacks. That's why you can create extra convergence metrics, etc., that are far harder to use if implementing in pure TensorFlow. The part where TensorFlow is specifically lacking functionality is in its ability to handle these callbacks (both pre-built in Keras or user-defined). I've had good success with the solution I mentioned with SessionRunHooks, but still, it is a terrible design choice by the TensorFlow people to create this in a way that is not directly compatible with all the work Keras had done.

> "from those tf.Estimator seemed the last devops-intense way to go (Horovod needs MPI and CERNDB/Keras Spark)."

Just based on how poorly designed the tf.Estimator API is though, I'm not actually sure the other methods would require less devops or less investment. In some cases for standard models, yes. But if you've already committed to using Keras for very customized situations, then going back to the dark ages with native TensorFlow will often be much more work and more error prone than using the other solutions. The Horovod dependence on MPI in particular is fairly simple and needs little management. Most people having done ML / stats PhDs will already have managed far more difficult situations with MPI previously anyway, or at least have the Linux skills needed. The point is you have a fighting chance, whereas deciphering undocumented and badly designed corners of TensorFlow often leaves you with no fighting chance.

link

wodenokoto 2918 days ago

Funny, I dumped tensor flow a few years ago because it wasn't possible to do bleeding edge stuff.

link

bitL 2918 days ago

Yeah, for many bleeding edge things you still need PyTorch ;-)

link

riku_iki 2918 days ago

Maybe you have a chance to provide example?

link