| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by namibj 860 days ago
	C.f. also Universal Transformer: the same layer stacked a lot. The sparse version of that is basically MoE with also a stick-breaking mechanism to prevent vanishing gradient while letting the model decide whether to terminate layer-count at a token early (ofc with training rewards to favor less layers, to represent the compute savings).