|
|
|
|
|
by samsartor
230 days ago
|
|
I'm skeptical that we'll see a big breakthrough in the architecture itself. As sick as we all are of transformers, they are really good universal approximators. You can get some marginal gains, but how more _universal_ are you realistically going to get? I could be wrong, and I'm glad there are researchers out there looking at alternatives like graphical models, but for my money we need to look further afeild. Reconsider the auto-regressive task, cross entropy loss, even gradient descent optimization itself. |
|
The softmax has issues regarding attention sinks [1]. The softmax also causes sharpness problems [2]. In general this decision boundary being Euclidean dot products isn't actually optimal for everything, there are many classes of problem where you want polyhedral cones [3]. Positional embedding are also janky af and so is rope tbh, I think Cannon layers are a more promising alternative for horizontal alignment [4].
I still think there is plenty of room to improve these things. But a lot of focus right now is unfortunately being spent on benchmaxxing using flawed benchmarks that can be hacked with memorization. I think a really promising and underappreciated direction is synthetically coming up with ideas and tests that mathematically do not work well and proving that current arhitectures struggle with it. A great example of this is the VITs need glasses paper [5], or belief state transformers with their star task [6]. The Google one about what are the limits of embedding dimensions also is great and shows how the dimension of the QK part is actually important to getting good retrevial [7].
[1] https://arxiv.org/abs/2309.17453
[2] https://arxiv.org/abs/2410.01104
[3] https://arxiv.org/abs/2505.17190
[4] https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5240330
[5] https://arxiv.org/abs/2406.04267
[6] https://arxiv.org/abs/2410.23506
[6] https://arxiv.org/abs/2508.21038