Hacker News new | ask | show | jobs
by chessgecko 611 days ago
My hypothesis for why this works that it mitigates the downsides of rope

to eli5:

rope is the modern strategy used to give information to the model about how far a query and a key are apart when doing attention. It's the best strategy we have now, but has a major downside, where it makes some connections between tokens that are far apart much stronger than you would like them to be. Xpos (https://arxiv.org/pdf/2212.10554) is another paper by microsoft tackling issues with rope and you can see figure 1 on page 4 to get a visual interpretation of the sinusoidal attention strength (you would like it to be smooth).

I think a big reason differential transformers is working so well, especially on long sequence stuff, because when both q1 and q2 don't match a token, the rope relative strength will still have the same value and the noise will cancel out. Leaving intended matches, but at the cost of somewhat dampening the original value rope brought.

Just a hypothesis though. It would be easy to test by running this experiment against a baseline where both use alibi attention (https://arxiv.org/pdf/2108.12409) which has a different set of tradeoffs this wouldn't mitigate, but still a really interesting result.