Hacker News new | ask | show | jobs
by formalsystem 2918 days ago
This article echoes my experience as well. I was working on some core NLP models for a larger tech company and wanted to experiment with Keras. I had my models designed within a day and training done within another and had amazing model perf.

I was also told that doing it the real way using Tensorflow would be the way to go and I agree with that sentiment if my problem was Google scale which it wasn't. In fact I would argue that most workloads around the world are not Google scale and neither are most Google workloads.

This attitude of "real deep learning engineers use Tensorflow" is an unhelpful way of saying "I agree that the API is unreadable but I've invested so much time in the ecosystem that I'll refuse to see its usability problems". Kind of reminds me of assembly programmers that thought C wasn't for l33t 10xx pwner programmers.

9 comments

> Kind of reminds me of assembly programmers that thought C wasn't for l33t 10xx pwner programmers.

The problem with TensorFlow is mainly that you, as a user, have to build a data-dependency graph. This is something a C compiler can do very well, but Python is not so suitable for that.

So, in my view, TensorFlow chose the wrong substrate for their "more efficient" library. Instead, they should have developed their own language, where the whole data-flow graph determination could be implicit, and not a concern for the programmer.

However, computing a data-flow graph as-you-go (by the library, not the user), like (I think) is done in some libraries, is quite a good approach, since the overhead is quite small (percentage-wise) compared to the large tensor operations that can be performed in highly optimized code.

> So, in my view, TensorFlow chose the wrong substrate for their "more efficient" library. Instead, they should have developed their own language, where the whole data-flow graph determination could be implicit, and not a concern for the programmer.

You just described Swift for TensorFlow.

Well, apparently :) They say: We believe that machine learning tools are so important that they deserve a first-class language and a compiler.

However, I'd like to see some numbers on how more efficient it is to build a graph in advance, given that the lion's share of the computations will be in tensor math anyway (which can be heavily optimized, and is independent of the graph).

The problem is that if you build the graph as-you-go, dataflow graph optimizations cannot be done efficiently (some high level optimizations such as data layout optimization, automatic data / model parallelism etc.). Swift can do all these because the compiler can extract the graph out ahead of time.
> You just described Swift for TensorFlow

This one? https://www.tensorflow.org/api_docs/swift/

As others are pointing out, TF isn't that hard.

Or, rather, it is hard but the difficulty is from getting an intuition for what part of this weird multi layer net is producing this weird behavior and is it an artefact or something interesting, and is the connectivity complete and is should I change the learning rate and activation functions?

The real reason to use Tensorflow is the same reason you might use a Go framework instead of Rails: in your heart you have this hope that this thing will one day grow into a really large project and support lots of people and that will be easier with this scalable, optimized code.

Its not even that you'll hit Google scale, its that you'll hit popular scale and still serve the whole thing out of your Digital Ocean droplet.

"Its not even that you'll hit Google scale, its that you'll hit popular scale and still serve the whole thing out of your Digital Ocean droplet."

Are you saying that model inference is slower or less efficient for a model built and trained in Keras, than the same model architecture built directly in tensorflow?

Actually, with Tensorflow as a Keras backend, I would expect them to be the same. I am not sure where the performance difference between TF and TF as a backend come from.

I do think that pure TF would be easier to scale up over multiple servers etc. but that's only because I don't know how it would work in Keras. Maybe its easy.

Its pretty straightforward to convert a keras model to a tf graph, as long as you used a tf backend in keras.
I would think the difference would be from the data input pipeline, efficiency in batching, updating online models. The inference itself would be the exact same.
So I'm automating my job on a shitty tiny laptop.

Do you think I'll be able to use DL?

I was also told that doing it the real way using Tensorflow would be the way to go and I agree with that sentiment if my problem was Google scale which it wasn't.

Use the right tool for the job. Keras can get you to a working model faster. However, I am not sure what the current situation is, but in the past it was not possible to dump and freeze Keras' Tensorflow graphs. This can be a problem if you want to embed a model in a non-Python application.

This attitude of "real deep learning engineers use Tensorflow"

Real engineers use whatever they need to use. But I think that you are overstating the difficulty of Tensorflow. Over the last 6 months, we have hired a couple of students for a research project. Since we standardized on Tensorflow, they had to implement new models in Tensorflow. All of them were up to speed in Tensorflow pretty quickly (they mostly do RNNs and seq2seq learning).

> dump and freeze Keras' Tensorflow graphs

You can get a direct reference to the graphs if you want, that will let you do anything tensorflow lets you do. I think this is what you want:

  # This assumes your model is ready to be called with .predict()
  sess = keras.get_session()
  graph = sess.graph
  graph_dev = graph.as_graph_def()

  frozen_graph = tf.graph_util.convert_variables_to_constants(
      sess, graph_def, nodes_to_output)

  encoded_frozen_graph = frozen_graph.SerializeToString()
That didn't work before, but admittedly, the last time I tried was probably 1.5 years ago.
Even better... use Keras' MxNet backend. Training is ~30% faster, you get multi-GPU for free, and you can perform inference in MxNet easily.

Not to mention you can more easily use channels-first data, quantize to FP16/INT8 more easily, and export to ONNX for use w/ Tensor-RT and/or Intel Nervana.

> in the past it was not possible to dump and freeze Keras' Tensorflow graphs.

This was never true.

There was no obvious Keras API for this, but you could build a model with the Keras API, then use the TF API to save it. The inference API would be the TF API (i.e. you'd need to find the names of all your input and output tensors and use those with Session.run).

This was never true.

Except that this was true. I do not remember the exact details, because this was the end of 2015 or beginning of 2016. But dumping/freezing definitely failed on some graphs constructed with Keras.

There was no obvious Keras API for this, but you could build a model with the Keras API, then use the TF API to save it.

That was easy to figure out. Read the backend implementation and you can see how you can get the graph definition, etc.

> Kind of reminds me of assembly programmers that thought C wasn't for l33t 10xx pwner programmers.

It's funny because this the same attitude C/C++ programmers have towards developers using other languages now...

Pffft what are programming languages??? If you aren't writing code in straight up binary then you aren't a real h@k3r
I think there is an emacs command to convert between that and butterfly wing flaps directing cosmic rays to flip bits.

(h/t xckd)

It’s pretty universal thing...
Try to run multiple models/ensemble training on many computers with many GPUs to pick up the best performing model or combo. TensorFlow so far has probably the easiest approach for it. That might be reason for the attitude "real deep learning engineers use Tensorflow", as other approaches either don't scale that well or you can't even model something you need for your bleeding-edge billion $-making approach, despite other frameworks being much much simpler/more natural and a joy to use.
Most of the tools that TensorFlow offers for multi-gpu and distributed model training will "just work" directly with Keras models too, or with really minor tweaks. You can even easily mix and match pure TensorFlow code (like explicitly setting the device with a device placement context manager) with Keras code.

See e.g. [0] and [1] linked below.

For model ensembling, it's even easier. After training, in Keras you could simply load your multiple models and create a new Model() object that does nothing but use a merge layer (with mode set to averaging) to average across multiple input models, even if the models share layers or have other crazy constraints. Writing that final ensemble is extremely easy in Keras.

In my experience researching and productionizing very deep Keras models for an image processing use case that has moderately tight performance constraints, Keras has proved to scale extremely well and the code remains dead simple the whole time.

[0]: < https://blog.keras.io/keras-as-a-simplified-interface-to-ten... >

[1]: < https://www.tensorflow.org/programmers_guide/estimators#crea... >

Thanks for the links! Do you know how to convert a generator from Keras to an input in estimator, add class weights, custom loss functions, plug-in various Keras-based callbacks as well? I couldn't find any guide for that part.

What do you use to orchestrate distributed training in Keras?

> "how to convert a generator from Keras to an input in estimator"

This is a bit of a mistaken question, because you would not "convert" a DataGenerator into an estimator input. Instead, you can just wrap the DataGenerator in a simple function that lazily outputs the next batch of training examples. Input functions for Estimators are just functions that accept no arguments and produce a 2-tuple, with first component of a dictionary of named inputs and second component of the target value. You can write your own wrapper functions that consumes from a DataGenerator and normalized the output to the format. I'm sure there will be a helper function to do this automatically in the future, but it's about as easy as can be to just wrap with a function anyway.

> "add class weights, custom loss functions"

This too seems mistaken, because this is part of the compiled Keras model, before ever converting anything to TensorFlow Estimator. You can use whatever you want for this and the Keras Model.compile function accepts dictionaries for loss and loss_weights, as well as custom add_loss usage in your own layers (even pass through layers that don't affect the computation graph).

> "plug-in various Keras-based callbacks as well"

This is admittedly slightly harder, but I think it's a little bit of an unfair question because Keras offers far more functionality in its Callbacks than TensorFlow offers with predefined hooks. "Penalizing" Keras because TensorFlow offers less functionality doesn't seem right.

Either way, this is also not too hard. For any Callback you want to use from Keras, you basically just write a tiny wrapper class that subclasses from session_run_hook.SessionRunHook from tensorflow, and then maps the TensorFlow naming conventions, like "begin" or "before_run" etc., to wrap the equivalent method from the Keras callback, like "on_train_begin", or "on_epoch_end".

The bigger point is that this headache is because of TensorFlow. Both because TF chose a really silly class design for the SessionRunHooks thing, making automatic conversion from Keras (which has the more established set of pre-existing callbacks) harder for no good reason, and also because TensorFlow lacks functionality that Keras gives you for free.

For orchestration, my team just uses a simple GPU cluster where the native device placement primitives with TensorFlow allow us to scale to as many GPUs as we've needed (max in the dozens).

For distributing and orchestrating over larger clusters, Keras provides some good alternatives right on its own FAQ page:

< https://keras.io/why-use-keras/#keras-has-strong-multi-gpu-s... >

In the end, I would not claim you can immediately translate every complex feature of Keras, like deep custom callbacks or something, over to TensorFlow ... but that's usually not a big deal. Most times, you just want to port a fairly standard model to the Estimator API, and for this, it "just works" directly and is easy to use for local, small-ish clusters of GPUs.

When you have a much rarer problem that needs a huge GPU cluster, then use the other suggests like dist-keras or Horovod, or write your own simple map-reduce-ish wrapper to put data on different nodes and deploy e.g. a containerized training application.

Also people need to definitely keep in mind that most of the limitations are TensorFlow's own fault for not designing things to be compatible with heavily used Keras features like Callbacks out of the box. TensorFlow has a history of doing this, and has been very developer-unfriendly in this way even when it has no downside or impact on performance or anything. The core TensorFlow designs suffer from an unfortunate "not invented here" kind of philosophy, even when dealing with Keras.

> Instead, you can just wrap the DataGenerator in a simple function that lazily outputs the next batch of training examples.

You probably know those simple generators aren't recommended to be used by Keras, instead keras.utils.Sequence is preferred due to (Keras doc): "Sequence are a safer way to do multiprocessing. This structure guarantees that the network will only train once on each sample per epoch which is not the case with generators."

I couldn't see any equivalent of this for estimators, sadly, and wrapping it up in a naive generator seemed like a functionality downgrade.

> because this is part of the compiled Keras model, before ever converting anything to TensorFlow Estimator

Right, you specify a loss function before compiling, however if it is a custom one and you for some reason need to reload a model snapshot (i.e. resuming training), you need to provide it separately or the loading fails. I haven't found any docs on this. Imagine your training optimizer automatically generating loss functions by means of function composition, e.g. you put a mix of +-*/,log,exp,tanh etc. based off some past training experience of what helped in individual cases/literature, then taking 1000s of these loss functions and pushing them to a large cluster where they are scored on how well did they perform, keeping only the best performing ones.

Class weights are specified in fit_generator(), not in compile time; again, here I couldn't find any description on how to convert Keras' weight dictionary to what TensorFlow needs.

> Callbacks... "Penalizing" Keras because TensorFlow offers less functionality doesn't seem right.

The thing here is that some of those callbacks are mandatory for a training to converge, e.g. decreasing learning rate, escaping plateau situations, computing various stats that aren't provided by Keras (outside loss/accuracy; you might want F1, Fleiss/Cohen's Kappa, Matthews correlation coefficient, AUC ROC etc.) that might be decisive for keeping/discarding a model; then also multi GPU callbacks; some people even use callbacks to perform the whole distributed computation as well. In my examples, if I remove any of those callbacks, my models won't achieve any kind of usability but with those callbacks I match world-class results. I couldn't find any non-insane way to map them to TensorFlow prior to our conversation.

As I mentioned, I have a very large cluster, each node with multiple GPUs, so I need an orchestration on both hyperparameters/loss functions per node as well as within each node to run on multiple GPUs.

The page from Keras you mentioned was precisely my starting point and from those tf.Estimator seemed the last devops-intense way to go (Horovod needs MPI and CERNDB/Keras Spark).

I'll take a deeper look into SessionRunHook you mentioned - thanks! ;-)

For class weights, the easiest thing is to just generate that as another one of the items placed into the input function dictionary, e.g. when you wrap the DataGenerator. Then have a custom loss function that takes this input element and applies the weight for that training sample. Again, the need to do slight extra work is a limitation of TensorFlow here, not of Keras, but because Keras is so flexible, it's super easy to work around.

> "computing various stats that aren't provided by Keras.."

It seems like you have this backward. Keras provides the easy interface to create the custom callbacks. That's why you can create extra convergence metrics, etc., that are far harder to use if implementing in pure TensorFlow. The part where TensorFlow is specifically lacking functionality is in its ability to handle these callbacks (both pre-built in Keras or user-defined). I've had good success with the solution I mentioned with SessionRunHooks, but still, it is a terrible design choice by the TensorFlow people to create this in a way that is not directly compatible with all the work Keras had done.

> "from those tf.Estimator seemed the last devops-intense way to go (Horovod needs MPI and CERNDB/Keras Spark)."

Just based on how poorly designed the tf.Estimator API is though, I'm not actually sure the other methods would require less devops or less investment. In some cases for standard models, yes. But if you've already committed to using Keras for very customized situations, then going back to the dark ages with native TensorFlow will often be much more work and more error prone than using the other solutions. The Horovod dependence on MPI in particular is fairly simple and needs little management. Most people having done ML / stats PhDs will already have managed far more difficult situations with MPI previously anyway, or at least have the Linux skills needed. The point is you have a fighting chance, whereas deciphering undocumented and badly designed corners of TensorFlow often leaves you with no fighting chance.

Funny, I dumped tensor flow a few years ago because it wasn't possible to do bleeding edge stuff.
Yeah, for many bleeding edge things you still need PyTorch ;-)
Maybe you have a chance to provide example?
From my opinion: Getting started with Tensor Flow, and having a model designed within a day and training within another is also possible. This mostly depends on your model and your data, and (imho) not on the framework of choice.

For all, Keras/PyTorch/Tensorflow, you'll need to learn the API - but if you have any ML background, that should be straight forward.

Yes, for all practical problems data is the biggest challenge.

Though, debugging matters. In TF it is easy to get errors and spend a lot of time searching for them. In PyTorch it is straightforward. It matters the most when the network, or cost function, is not standard (think: YOLO architecture).

E.g. when I wanted to write some differentiable decision tree it took me way longer in TF (I already knew) than with PyTorch, having its tutorial on another pane.

TensorFlow needs some Deep Learning-based assistant to identify common cause of errors you might see on the AST level. Cryptic errors are its weakness and an AI trained to spot correlations between Python AST and error might be very helpful.
Tensorflow also does not work without needing to build urself for my 2010 or so cpu. So i ended up trying pytorch and i am glad i did. Liking it better than esoteric errors.
I used Keras a few years ago, and really liked it and still recommend it to people, even contributing some code back to it, but I don't think it obviated the need to know TF, since eventually you want to do something that's not in the Keras toolbox, and then you need to understand what it's doing under the hood.
>>I was also told that doing it the real way using Tensorflow would be the way to go and I agree with that sentiment if my problem was Google scale which it wasn't. In fact I would argue that most workloads around the world are not Google scale and neither are most Google workloads.

You can convert the Keras model to TF pretty easily if you need to, as long as you use the TF backend. I did this, and converted the string preprocessing in TF so the model could be used in TF serving taking only the string as input.

Did you look at Tensorflow Estimators ? They are a new high-level API with built in support for distributed training.

https://www.tensorflow.org/programmers_guide/estimators

Yes, they're pretty ugly TBH. All they've done is provide some decent "canned" estimators but for anything custom you're still using the base tensorflow API. Not to mention feeding in something like numpy arrays > 2GB is a huge pain (their Dataset API doesn't fully work).
Interesting. So do you recommend Keras+Tf as well, or drop Tensorflow altogether ?
Keras + MxNet is far better. It's faster, you get multi-gpu out of the box, and it's reaaaally easy to export to ONNX for fast inference elsewhere.