Hacker News new | ask | show | jobs
by gamegoblin 4348 days ago
A note on dropout:

If your layer size is relatively small (not hundreds or thousands of nodes), dropout is usually detrimental and a more traditional regularization method such as weight-decay is superior.

For the size networks Hinton et al are playing with nowadays (with thousands of nodes in a layer), dropout is good, though.

1 comments

I've found a combination of the 2 to be great. Most deep networks (even just the feed forward variety) tend to generalize better with mini batch samples of random drop out on multiple epochs. This is true of both images and word vector representations I've worked with.
I've found that with a large enough network, using the two together is good, but as your network grows smaller and you lose redundancy, dropout starts to hurt you when compared with using weight-decay alone.

In huge networks in which you have a lot of non-independent feature detectors, your network can tolerate to have ~50% of them dropped out and then improves when you use them all at once, but in small networks when you have a mostly independent features (at least in some layer), using dropout can cause the feature detectors to trash a fail to properly stabilize.

Consider a 32-16-10 feedforward network with binary stochastic units. If all 10 output bits are independent of each other, and you apply dropout to the hidden layer, your expected number of nodes in 8, so you lose information (since the output bits are independent of each other) without any hope of getting it back.

Definitely agreed. The networks I'm typically dealing with are bigger. I would definitely say the feature space needs to be large enough to get good results.

That being said, most problems now a days (at least for my customers are bigger numbers of params anyways)