|
|
|
|
|
by sdenton4
931 days ago
|
|
Yes there's very good theoretical reasons for skip connections. If your initial matrix M is noise centered at 0, then 1+M is a noisy identity operation, while 0+M is a noisy deletion... It's better to do nothing if you don't know what to do, and avoid destroying information. I appreciate the sibling comment perspective that memory pressure is a problem, but that can be mediated by using fewer/longer skip connections across blocks of layers. |
|