|
|
|
|
|
by gwern
243 days ago
|
|
> Note again that a residual connection is not just an arbitrary shortcut connection or skip connection (e.g., 1988)[LA88][SEG1-3] from one layer to another! No, its weight must be 1.0, like in the 1997 LSTM, or in the 1999 initialized LSTM, or the initialized Highway Net, or the ResNet. If the weight had some other arbitrary real value far from 1.0, then the vanishing/exploding gradient problem[VAN1] would raise its ugly head, unless it was under control by an initially open gate that learns when to keep or temporarily remove the connection's residual property, like in the 1999 initialized LSTM, or the initialized Highway Net. After reading Lang & Witbrock 1988 https://gwern.net/doc/ai/nn/fully-connected/1988-lang.pdf I'm not sure how convincing I find this explanation. |
|