|
Cool, there are also additional issues with the RoPEAttention you might want to fix as well : The reference paper for rotary embedding is Roformer https://arxiv.org/pdf/2104.09864v4.pdf First you shouldn't rotate the values, only keys and queries.
This is wrong : v_out = (torch.bmm(v.transpose(0,1), self.R[:m, ...])).transpose(0,1) Second you shouldn't apply multihead attention which as additional inner weights that will mess with the rotations you have just done.
This is wrong : activations, attn_weights = self.multihead (q_out,k_out,v_out) Instead you should use scaled_dot_product_attention( q_out,k_out,v_out) Third, each attention head should have been treated similarly, and each attention head should have the same rotation frequencies. |
wait does that mean that rotary embeddings don't work with multiheaded attention? First I have heard of this. Wouldn't this be an issue with position embeddings as well (for example sinusoidal position embeddings are a special case of rotary embeddings)?