| HN Mirror

They do actually make several claims as to the efficiency of the architecture compared to the Transformer, as you can see by the many graphs throughout the document. Their claim that their architecture is the only one that allows for gradually increasing the number of weights is a prominent one too, though, so I'll explain why I don't find that claim credible.

The idea of gradually increasing the size of a Transformer to save on training costs is not a novel one, and researchers have explored ideas to this effect almost since the Transformer's inception. There are many ways to do it. We can start with a small number of layers, and then add in more initialized to the identity. We can keep the number of layers constant and start with a small, then increase the width throughout training, initializing the extra weights to zero. We can reformulate all weight matrices as LoRAs and start with a tiny rank, then slowly increase the rank until we reach a full-rank equivalent. Or we can use two or three of these strategies and mix them any way we want.

The performance of the resultant model is entirely dependent on what strategies you use, and how you mix them: whether you choose to increase width, depth, or rank all at once, one at a time, or somewhere in-between, and whether you increase those values linearly, exponentially, or by some new function you just thought of. Because there are so many ways to gradually increase the size of a Transformer, when you think of a new way, you've got to pick a strong baseline to compare against.

The authors choose the baseline Net2Net (2015). The paper, written two years before the invention of the Transformer, regrettably does not include pre-trained Transformer results for the authors to compare against. So, the authors train their own Net2Net model, and provide a couple nice graphs where the TokenFormer loss curve is under the Net2Net Transformer's for the entirety of training in Figure 6 and Figure 7. They provide no details of the training setup that produced these graphs: the model size, layer count, and width are all missing, as well as basic hyperparameters like the learning rate and batch size. They train on enwik8 (100MB) and seem to repeat data: near the end the TokenFormer reaches sub-0.5 perplexity levels, an impossible result for English text with reasonable entropy that a language model has never seen before.

Why choose this strange, home-grown baseline, reliant on a method developed in 2015, to compare against? Why not at least use a method tuned specifically for the Transformer (such as [1](https://arxiv.org/abs/2203.06211), [2](https://arxiv.org/abs/2401.02415), [3](https://arxiv.org/abs/2309.03852), to name a few!) If their progressive scaling method is truly better, it would only benefit from comparison against a strong baseline.

The authors' progressive scaling method is an idea that has been explored many times by other ML researchers. Their method in particular is compared against a weak baseline with no concrete details other than the loss graphs. In my humble opinion, it's merely an effort to shoehorn a claim of novelty into a paper that isn't.