|
|
|
|
|
by visarga
88 days ago
|
|
> there is nothing about the training process of these models that would encourage them to make the output of any layer apart from (n-1) meaningful as the input of layer n There is something that does exactly that - the residual connections. Each layer adds a delta to it, but that means they share a common space. There are papers showing the correlation across layers, of course it is not uniform across depth, but consecutive layers tend to be correlated. |
|