|
|
|
|
|
by Smerity
897 days ago
|
|
I think you've done a great explanation expansion except I believe it's ALiBi ("Attention with Linear Biases Enables Input Length Extrapolation"), a method of positional encoding (i.e. telling the Transformer model how much to weight a distant token when computing the current output token). This has been used on various other LLMs[2]. [1]: https://arxiv.org/abs/2108.12409 [2]: n.b. Ofir Press is co-creator of ALiBi https://twitter.com/OfirPress/status/1654538361447522305 |
|
In more intuitive terms, your bog-standard transformer overdoes it in terms of considering all context equally in the final prediction, and we historically used rather blunt-force instruments like causally masking everything to zero.
These techniques are still heuristic and I imagine every serious shop has tweaks and tricks that go with their particular training setup, but the Rope shit in general is kind of a happy medium and exploits locality at a much cheaper place in the overall computation.