| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by tananan 616 days ago
	Haven't gone through the paper fully, but just looking at the functional form of their attention, it seems more like a constraint on a standard MHA than an architectural discovery. Take a vanilla MHA, tie the V projection between consecutive heads, make the output projection subtract consecutive heads, with some fixed prefactor and voila, you're most if not all of the way there.