|
|
|
|
|
by namibj
813 days ago
|
|
C.f. also Universal Transformer: the same layer stacked a lot.
The sparse version of that is basically MoE with also a stick-breaking mechanism to prevent vanishing gradient while letting the model decide whether to terminate layer-count at a token early (ofc with training rewards to favor less layers, to represent the compute savings). |
|