Hacker News new | ask | show | jobs
by visarga 88 days ago
> there is nothing about the training process of these models that would encourage them to make the output of any layer apart from (n-1) meaningful as the input of layer n

There is something that does exactly that - the residual connections. Each layer adds a delta to it, but that means they share a common space. There are papers showing the correlation across layers, of course it is not uniform across depth, but consecutive layers tend to be correlated.