|
|
|
|
|
by yldedly
1058 days ago
|
|
The projection and MLP layers don't compare all embedding pairs like attention does, so they can't distinguish between contexts where delimiters are low- vs high-importance. The projection layer mixes the multi-heads in the same way always, and the same MLP is applied to every input. |
|