Hacker News new | ask | show | jobs
by tananan 613 days ago
Now I'm wondering, isn't there usually a `num_heads x value_dim -> model_dim` projection that goes after a MHA? The W in `softmax(QK)VW`? That one can play the role of this subtraction in a vanilla transformer, no? So I wonder what kind of advantage does splitting things up like this bring.