|
|
|
|
|
by vessenes
613 days ago
|
|
Yes. This looks really, really good to me. Cross the board improvements in training time, perplexity improvements per both token trained and per model size. I'm reminded of MoE architectures, in that world we're choosing an optimal small model to process part or all of the inference job; I wonder if MoE got some of the same benefits from forcing the Transformer to distinguish between alternate possibilities. In any event, I'd imagine that this will get widely adopted if the numbers hold up; like I said, this seems to be basically no downside, and should be easy to replicate. |
|
https://github.com/microsoft/unilm/blob/master/Diff-Transfor...