| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by shpongled 358 days ago
	I looked through their torch implementation and noticed that they are applying RoPE to both query and key matrices in every layer of the transformer - is this standard? I thought positional encodings were usually just added once at the first layer

1 comments

m_ke 358 days ago

No they’re usually done at each attention layer.

link

shpongled 358 days ago

Do you know when this was introduced (or which paper)? AFAIK it's not that way in the original transformer paper, or BERT/GPT-2

link

spott 357 days ago

All the Llamas have done it (well, 2 and 3, and I believe 1, I don't know about 4). I think they have a citation for it, though it might just be the RoPE paper (https://arxiv.org/abs/2104.09864).

I'm not actually aware of any model that doesn't do positional embeddings on a per-layer basis (excepting BERT and the original transformer paper, and I haven't read the GPT2 paper in a while, so I'm not sure about that one either).

link

shpongled 357 days ago

Thanks! I'm not super up to date on all the ML stuff :)

link

Scene_Cast2 357 days ago

Should be in the RoPE paper. The OG transformers used multiplicative sinusoidal embeddings, while RoPE does a pairwise rotation.

There's also NoPE, I think SmolLM3 "uses NoPE" (aka doesn't use any positional stuff) every fourth layer.

link

Nimitz14 357 days ago

This is normal. Rope was introduced after bert/gpt2

link