|
Yes, and this results in the MLP layer being functionally unchanged. In the vanilla GPT-2 Transformer, the MLP layer is defined as a 4x up-projection, then a non-linearity, followed by a 4x down-projection. This can be understood as a specific case of their method, as they describe here: > The number of key-value parameter pairs in both the
query-key-value and output projections corresponds directly to the hidden dimension. In contrast,
the FFN module utilizes four times the number of parameter pairs relative to the hidden size. Here is the original FFN as described in GPT-2: y = GELU(x @ W_u) @ W_d And here is their FFN, when understood as a special case of their "Attention": y = modified_softmax(x @ W_k) @ W_v You can name the matrices whatever you want, but the grand enhancement that the authors make to the FFN is just replacing the GELU with a different non-linearity. Shazeer already conducted extensive empirical tests of different non-linearities for the FFN layer in 2020. Among the best were SwiGLU, which is used in Llama today. Unsurprisingly, a modified softmax did not make the cut. Again, if the changes in this paper were truly a step forward instead of a mindless scrambling of architecture in an effort to achieve something publishable, it would show in the results. Instead, as you can see in their appendix, TokenFormer is on-par or loses in fair comparisons to other models. |