| I only spent a few minutes skimming thr paper, but: 1) there are a lot of papers claiming to be the successor to the Transformer, and not all of them are cited; e.g., the MetaFormer is missing
https://arxiv.org/abs/2111.11418.
Another candidate that wasn't compares against (or at least argued why it wouldn't make sense to compare against) are the Hopfield Networks
https://arxiv.org/abs/2008.02217. So until a more solid Related Work section is written (their section is actually called "Relation to and Differences from Previous Methods") I reserve the right to be skeptical whether their model is the "best" successor to the Transformer. 2) they say in the abstract "We theoretically derive the connection between recurrence and attention" but I couldn't find a longer theorem-proof section. So either this is done only in a cursory manner, or the proof is very easy.
Recurrence and attention have been around for a long time as concepts, so surely there are already proofs in similar contexts of this fact (I am not working in this particular area of Machine Learning, so I don't know the SOTA by heart, but I strongly suspect that these aspects have been discusses previously; thr Hopfield Network paper I linked to unearthes some theoretical facts about attention, for example). So -based on my very cursory reading- this paper seems like an interesting approach, but I do see some holes in thr execution. Time will tell whether Rentetive Network will become mainstream or not. Ok, this was my five minute review of the paper. Now I have to urgently return to completing my actual reviews for NeurIPS, haha. |
Neither of those papers are NLP applicable? And I think it's perfectly fair to focus on the alternatives (ie. like H3 and RWKV) that have been able to scale up to LLM levels and perplexity, which neither of the alternatives you mention have. Should they just cite every 'is All You Need' paper?