| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by nl 809 days ago

Most important paper of 2024.

The idea that we want models not to have to use the same amount of compute for every token has been around for a while. This is the first compelling mechanism I've seen for doing it.

> Equipped with these new methods, we can sample autoregressively by choosing to route tokens to or around a block based on the router’s output, which does not depend on any information from future tokens. We provide empirical evidence that this is a relatively easy auxiliary task that quickly achieves 99% accuracy.

Does anyone else find this is a bit surprising?

2 comments

namibj 809 days ago

Sparse Universal Transformer is older and already did routing-based early termination...

link

imtringued 809 days ago

Most important? The idea that not every token needs the full context window should be an obvious optimization.

link

whimsicalism 809 days ago

that’s not the idea here

link