| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by tananan 660 days ago
	Now I'm wondering, isn't there usually a `num_heads x value_dim -> model_dim` projection that goes after a MHA? The W in `softmax(QK)VW`? That one can play the role of this subtraction in a vanilla transformer, no? So I wonder what kind of advantage does splitting things up like this bring.