| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by yldedly 1103 days ago
	The projection and MLP layers don't compare all embedding pairs like attention does, so they can't distinguish between contexts where delimiters are low- vs high-importance. The projection layer mixes the multi-heads in the same way always, and the same MLP is applied to every input.