| HN Mirror

I've found that with a large enough network, using the two together is good, but as your network grows smaller and you lose redundancy, dropout starts to hurt you when compared with using weight-decay alone.

In huge networks in which you have a lot of non-independent feature detectors, your network can tolerate to have ~50% of them dropped out and then improves when you use them all at once, but in small networks when you have a mostly independent features (at least in some layer), using dropout can cause the feature detectors to trash a fail to properly stabilize.

Consider a 32-16-10 feedforward network with binary stochastic units. If all 10 output bits are independent of each other, and you apply dropout to the hidden layer, your expected number of nodes in 8, so you lose information (since the output bits are independent of each other) without any hope of getting it back.