|
|
|
|
|
by kylevedder
1441 days ago
|
|
Probably the most interesting trick from the paper is using the head as a soft supervisor for earlier layers of the network, with the intuition being that if the earlier layers learn to imitate the higher capacity later layers, it frees up the capacity of the later layers to better learn the residual and provides more dense supervisory signal. |
|