Hacker News new | ask | show | jobs
by eli_gottlieb 3593 days ago
>We can now train deep networks because we learned how to regularize - before it was impossible because of vanishing gradients.

Those are two different things. Vanishing gradient problems were ameliorated by switching from sigmoidal activation functions to rectified linear units or tanh activations, and also by dramatically reducing the amount of edges through which gradients propagate. The latter was accomplished through massive regularization to reduce the size of the parameter spaces: convolutional layers and dropout.

Stuff like residuals are still being invented in order to further banish gradient instability.