Hacker News new | ask | show | jobs
by yldedly 1058 days ago
The projection and MLP layers don't compare all embedding pairs like attention does, so they can't distinguish between contexts where delimiters are low- vs high-importance. The projection layer mixes the multi-heads in the same way always, and the same MLP is applied to every input.