| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by thesunkid 1185 days ago
	Dynamic Transformer-layers allocation per step is intuitively and computationally appealing. Can't believe this idea has been under-explored for years.

2 comments

f_devd 1185 days ago

This still isn't technically dynamic allocation since it always takes a top-k (constant k) tokens from the sequence, so more like dynamic routing, which was explored in Mixture-of-Expert models but only in Feed-Forward blocks and with a different routing scheme.

link

voxgen 1183 days ago

One can also make a model to learn the necessary context length for each layer and head to save a huge amount of FLOPs: https://arxiv.org/abs/1905.07799

link