Dynamic Transformer-layers allocation per step is intuitively and computationally appealing. Can't believe this idea has been under-explored for years.
This still isn't technically dynamic allocation since it always takes a top-k (constant k) tokens from the sequence, so more like dynamic routing, which was explored in Mixture-of-Expert models but only in Feed-Forward blocks and with a different routing scheme.
One can also make a model to learn the necessary context length for each layer and head to save a huge amount of FLOPs: https://arxiv.org/abs/1905.07799