|
|
|
|
|
by tananan
613 days ago
|
|
Now I'm wondering, isn't there usually a `num_heads x value_dim -> model_dim` projection that goes after a MHA? The W in `softmax(QK)VW`? That one can play the role of this subtraction in a vanilla transformer, no?
So I wonder what kind of advantage does splitting things up like this bring. |
|