|
|
|
|
|
by bsfjgngdnxy
3500 days ago
|
|
>MXNet can consume as little as 4 GB of memory when serving deep networks with as many as 1000 layers. So perhaps I'm not well versed enough in deep learning, but does this mean that they solved the vanishing gradient problem? How are they managing to do this? |
|
This is kind of related to solving the vanishing gradient issue in RNNs by using additive recurrent architectures like LSTMs and GRUs.
Alternatively it's possible to use concatenative skip connections as in DenseNets: https://arxiv.org/abs/1608.06993
Still using 1000 layers is useless in practice. State of the art image classification models are in the range 30-100 layers with residual connections and varying numbers of channels per layer depending on the depth so as to keep a tractable total number of trainable parameters. The 1000 layers nets are just interesting as a memory scalability benchmark for DL frameworks and to validate empirically the feasibility of the optimization problem but are of no practical use otherwise (as far as I know).