| As someone who has worked in this space, this paper is unfortunately total BS. Their claimed theoretical advancement is as follows. If you want to transform an input vector X to another vector Y of different dimension, "normal" people suggest to use a linear projection: create an appropriately sized matrix W and simply multiply it by your input: Given X ∈ d_in
and W ∈ d_in × d_out,
then Y ∈ d_out = X @ W. In the attention layer, where the input X is converted into queries Q, keys K, and values V, this is the simple strategy employed: Q = X @ W_q, K = X @ W_k, V = X @ W_v, and it has shown itself to be effective. This is too simple for the authors of this paper. They propose another approach. Instead of converting directly to the desired dimension, we will increase computation by creating an intermediate dimension, and introducing a non-linearity between them. Given X ∈ d_in,
and W_1 ∈ d_in × d_tmp,
and W_2 ∈ d_tmp × d_out,
then Y ∈ d_out = f(X @ W_1) @ W_2. Here, f can be any non-linearity. The authors choose softmax; it allows them to claim a superficial resemblance to attention. Later in the paper, they reveal it is not actually softmax, but a modified version to avoid gradient vanishing (softmax is not a very good general-purpose non-linearity). So, they replace all projections in the attention layer with this new strategy. So Q = f(X @ W_q1) @ W_q2. And K = f(X @ W_k1) @ W_k2. And V = f(X @ W_k3). The problem with this is not theoretical: this does increase the model's expressiveness and computational power. It is practical: we are adding parameters where we need them the least, in the attention layer. It is generally understood that LLMs do not need extra parameters in the attention layer. Actually, advancements like Grouped-Query Attention hinge on the idea that you can halve or even fourth the number of parameters in the attention layer without harming performance. The experience of the LLM community so far suggests that the authors' idea of adding even more parameters to the self-attention layer should degrade their models' performance while adding no tangible gain. The authors' numbers say otherwise. But it is hard to trust their numbers. When training a Transformer to compare against they replicate the original GPT-2 proposed in 2019. In doing so they ignore years of architectural improvements, such as rotary positional embeddings, SwiGLU, and RMSNorm that have culminated in Transformer++, the strong recipe which is what Meta's Llama series uses. We've seen this time after time in the various "Transformer killers" that used to be popular about a year ago. A researcher would think up some novel variant of linear attention, furiously test it against a weak GPT-2 baseline, find it blew it out of the water, and declare victory. Somehow, these never caught on, because when tested against a newer baseline these models weren't actually that great. The authors are doing the same thing here. In their tables they also include comparisons to other models. Actually, they exclusively select the EleutherAI suites: GPT-Neo, OPT, and Pythia. These models were not trained with any modern architectural improvements except rotary embedding (which EleutherAI invented), and so predictably TokenFormer crushes them. On the last page of the appendix the authors have included a full table with some more fair comparisons. Their TokenFormer-150M variant achieves a Pile ppl of 10.45 against Mamba-130M's 10.54. In the intermediate weight class, TokenFormer-450M matches Mamba-370M's 8.28 Pile ppl despite having 21% more parameters. And in the largest size, TokenFormer-1.5B loses to Mamba-1.4B, 6.91 to 6.80 ppl. Overall, the architectural tweak proposed in this paper is impractical, and the few fair comparisons they include are unimpressive. TokenFormer is another in a long line of Transformer-killers that have nice graphs of cherry-picked data, and will similarly fade into obscurity. |