Hacker News new | ask | show | jobs
by kylevedder 1441 days ago
Probably the most interesting trick from the paper is using the head as a soft supervisor for earlier layers of the network, with the intuition being that if the earlier layers learn to imitate the higher capacity later layers, it frees up the capacity of the later layers to better learn the residual and provides more dense supervisory signal.
1 comments

Yes, but to my surprise the "compound scaling" provides 3x more improvement in their ablation study. Also, I don't understand Table 8 in their ablation study for aux heads, specifically: why does it have different base benchmark values from Tables 6 and 7?