Hacker News new | ask | show | jobs
by rkaplan 3393 days ago
"In contrast to deep neural networks which require great effort in hyper-parameter tuning, gcForest is much easier to train."

Hyperparameter tuning is not as much of an issue with deep neural networks anymore. Thanks to BatchNorm and more robust optimization algorithms, most of the time you can simply use Adam with a default learning rate of 0.001 and do pretty well. Dropout is not even necessary with many models that use BatchNorm nowadays, so generally tuning there is not an issue either. Many layers of 3x3 conv with stride 1 is still magical.

Basically: deep NNs can work pretty well with little to no tuning these days. The defaults just work.

5 comments

I couldn't disagree more. The defaults don't just work, and the architecture of the network could also be considered a hyper parameter in which case what would be a reasonable default for all the types of problems ANN are used for?
Are you using batch normalization? If you are, an issue I see all the time is folks not setting the EMA filter coef correctly. In keras, it defaults to something like 0.99 which in my mind makes no sense. I use something around 0.6 and life is good. You want to get an overall good measurement of the statistics and in my mind the frequency cutoff when coef=0.99 is just way too high for most application. You usually want something that filters out just about everything except very close to DC.
The response to "the defaults should work just fine without any hyperparameter tuning" is "try fiddling with the EMA filter coefficient hyperparameter" ?

(Just poking fun. :P)

It's like the joke of the mathematician giving an exposition of a complex proof. At one point he says "It is obvious that X", pauses, scratches his head, does a few calculations. Leaves room for twenty minutes and returns. Then continues "it is obvious that X" and goes to the next step.

Deep in the field, it's fine for machine learning experts to say "everything just works" [if you've mastered X, Y, Q esoteric fields and tuning methods] since they're welcome to "humble brag" as much as they want. But when this gets in the way of figuring out what really "just works" it's more of a problem.

Interesting, totally new concept for me: Where can I read more about EMA filter coefficient in Keras? My Google-fu is failing.
I think they're referring to the momentum parameter at [1]. The exponential moving average (EMA) of the batch mean/variance is used in the batch normalizing transform (Algorithm 1 in [2]).

The momentum ranges from 0 to 1. If it's close to 1, which the default of 0.99 is, the EMA of the batch mean/variance will change slowly across batches. If it's close to 0, the EMA will be close to the mean/variance of the current batch.

The EMA acts as a low-pass filter. With a momentum close to 1, the EMA changes slowly, filtering out high frequencies and leaving only frequencies close to DC. Note that this is opposite to what grandparent says: 0.99 has a lower frequency cutoff than 0.6 does. So I'm not really sure what they're getting at there.

[1] https://keras.io/layers/normalization/#batchnormalization

[2] https://arxiv.org/abs/1502.03167

When working with images, do you use mode 0, 1, or 2?
They work well, just that you need a lot of patience (and know how) to work with them. Also GPUs are expensive. By the time you realize that you messed up you have wasted a lot of time. Of course this is true with any ml algorithm out there. But what I'm trying to say is it is possible that an as yet unknown method exists that may be less computationally complex.

One of the problems I see is that people abuse deep neural networks no end. One doesn't need to train a deep nn for recognizing structured objects like a coke can in a fridge. Simple hog/sift/other feature engineering may be a faster and better bet for small-scale object recognition. However expecting sift to out perform a deep neural net on imagenet is out of question. Thus when it comes to deploying systems in a short frame of time one should keep an open mind.

> One doesn't need to train a deep nn for recognizing structured objects like a coke can in a fridge.

I disagree. Sure, you don't need a NN to recognize one Coke can in one fridge for your toy robot project. If you want to recognize all Coke cans in all fridges, for your real-world, consumer-ready Coke-fetching robot product? You're going to need a huge dataset of all the various designs of Coke cans out there, in all the different kinds of refrigerators, and your toy feature engineered approach is going to lose to a NN on that kind of varied dataset.

Which is why you should do stereo or SfM, make a 3d reconstruction, and then do HOG or some 3D feature to recognise the coke can.

Trying to do it from images with a NN that doesn't comprehend 3D space is just silly.

I'm not sure if you're serious or throwing some very excellent shade.
I'm serious. If you have to rely on mono, single image inputs then yeah ImageNet is going to do better. But it will also mistake every picture of a coke can as the real thing. It will be horrifically sensitive to malicious inputs. Much better would be to use 2 calibrated lenses and do 3D reconstruction. Even if you're just doing the reconstruction as a sanity check for a NN to weed out the false positives.
Errm, hang on, are you saying that if you have a task of classifying unseen images given a labelled training set you should get a stereo camera or video camera and create another problem?

Which you can solve?

Because the problem is silly>

What if I say : "I will give you $10m to solve it, and if you fail, I will kill this very kind old monkey?"

Object recognition doesn't only exist in the subspace of labelled 2D images. It tends to be derived from a 3D space, which is a whole extra orthogonal data source that the "NN all the things" crowd is fastidiously ignoring.

Why, I'm not sure, but I'm guessing because it is hard/inaccurate to do with just NNs and parameter/network architecture tweaking. Possibly also because benchmarks with single mono images are much easier to make.

Just because it is hard with method A, and is harder to make benchmarks, doesn't mean method B isn't better.

Yes but am I missing something when I say that if the problem is to deal with labelled 2d images declaring that you should be working with 3d images or short video sequences doesn't help.

Sure, if you are building a Robot and I say "use this camera and a deep network" and you say "It'll work better with stereo" well... yes super do that!

But if we are working with mono images I don't understand how the observation helps?

But an NN can complete mess up when a new refrigerator is used, that wasn't part of the training set.

Also, the training is very asymmetric, since there are many more things NOT coke cans than there are coke cans.

> But an NN can complete mess up when a new refrigerator is used, that wasn't part of the training set

Not if your training set is representative. And this is just as true of feature engineered approaches, the only difference is that dealing with real world variation requires a lot less work with NNs because once you add the variation to your dataset you're done. With feature engineering that's only the first step because now you have to figure out where the new variation is breaking your features and how to modify them to fix it.

"Not if your training set is representative."

And herein lies a prominent failure mode of a huge amount of this sort of work that I've seen - hard to just "add the variation to your dataset" when your data set is one or more orders of magnitude too small to contain it. At that point all that remains is the handwaving.

The right response to insufficient data is usually simplifying the modeling.

I agree with rkaplan. I've been working with many different visual problems and that comment is pretty consistent with what I've seen.
No batch norm for LSTMs
Totally agree, but RF's are more related to "CNN-ish problems" (image classification and...?), not RNNs, or generally, any graphical sequence model.

EDIT: to clarify: "j/k" with the thing in parenthesis ;-)

GAN training is still spooky mysterious and can easily fail in nonintuitive ways.

Sometimes GANs converge or not depending on the random number seed, even with the same hyperparameters.

I'm not sure about that. The new GAN models over the past 2-3 months, like LS-GAN or WGAN, all seem to train much more stably. I've beaten up on WGAN with all sorts of strange tweaks and hyperparameter settings and while it may not work well, it's never catastrophically diverged on me the way DCGAN would at the drop of a hat.
Have you found any good ways to speed it up? The five-fold training on the Critic is very expensive.
No, not yet. I suspect that increasing the discriminator-only learning rate might help but haven't tried.
Try removing BN from the critic :)