Hacker News new | ask | show | jobs
by nl 809 days ago
Most important paper of 2024.

The idea that we want models not to have to use the same amount of compute for every token has been around for a while. This is the first compelling mechanism I've seen for doing it.

> Equipped with these new methods, we can sample autoregressively by choosing to route tokens to or around a block based on the router’s output, which does not depend on any information from future tokens. We provide empirical evidence that this is a relatively easy auxiliary task that quickly achieves 99% accuracy.

Does anyone else find this is a bit surprising?

2 comments

Sparse Universal Transformer is older and already did routing-based early termination...
Most important? The idea that not every token needs the full context window should be an obvious optimization.
that’s not the idea here