| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by spmurrayzzz 439 days ago

Not sure what you mean by summary paper, its a pretty dense topic that assumes a fair amount of prior knowledge of the fundamentals. But maybe the Meta blog post may suffice for that?

Otherwise yes there are lots of papers on this and related topics, a few dozen in fact. But here are some notable ones, a couple of them are linked in their blog post.

RoFormer: Enhanced Transformer with Rotary Position Embedding - https://arxiv.org/abs/2104.09864

Scaling Laws of RoPE-based Extrapolation - https://arxiv.org/abs/2310.05209

The Impact of Positional Encoding on Length Generalization in Transformers - https://arxiv.org/abs/2305.19466

Scalable-Softmax Is Superior for Attention - https://arxiv.org/abs/2501.19399

1 comments

theGnuMe 439 days ago

Thanks! I am familiar with attention, linear attention, flash attention etc... just not up to speed on how it is scaled to 1M or 10M context windows.

link

spmurrayzzz 439 days ago

Ah, got it. Yea, then I'd focus on learning how RoPE works first. That will at least help you understand how the retrieval in current long context implementations is so limited.

A colleague from a discord I spend time in threw together this video a year or so ago, might be helpful as a first watch before a deep dive: https://www.youtube.com/watch?v=IZYx2YFzVNc

Covers positional encoding as a general concept first, then goes into rotary embeddings.

link

theGnuMe 437 days ago

Thanks! That was super helpful.

link