|
|
|
|
|
by agibsonccc
4348 days ago
|
|
I've found a combination of the 2 to be great. Most deep networks (even just the feed forward variety) tend to generalize better with mini batch samples of random drop out on multiple epochs. This is true of both images and word vector representations I've worked with. |
|
In huge networks in which you have a lot of non-independent feature detectors, your network can tolerate to have ~50% of them dropped out and then improves when you use them all at once, but in small networks when you have a mostly independent features (at least in some layer), using dropout can cause the feature detectors to trash a fail to properly stabilize.
Consider a 32-16-10 feedforward network with binary stochastic units. If all 10 output bits are independent of each other, and you apply dropout to the hidden layer, your expected number of nodes in 8, so you lose information (since the output bits are independent of each other) without any hope of getting it back.