|
|
|
|
|
by spmurrayzzz
439 days ago
|
|
Not sure what you mean by summary paper, its a pretty dense topic that assumes a fair amount of prior knowledge of the fundamentals. But maybe the Meta blog post may suffice for that? Otherwise yes there are lots of papers on this and related topics, a few dozen in fact. But here are some notable ones, a couple of them are linked in their blog post. RoFormer: Enhanced Transformer with Rotary Position Embedding - https://arxiv.org/abs/2104.09864 Scaling Laws of RoPE-based Extrapolation - https://arxiv.org/abs/2310.05209 The Impact of Positional Encoding on Length Generalization in Transformers - https://arxiv.org/abs/2305.19466 Scalable-Softmax Is Superior for Attention - https://arxiv.org/abs/2501.19399 |
|