|
|
|
|
|
by allisdust
1023 days ago
|
|
Occam's razor tells us that if it's a great architecture/technical breakthrough, it would have taken the world by storm by now. Similar to the original transformer paper and model. Since 2017, the only successful models are variations of transformers. RNNs are no where in the picture. Simply believing a architecture is superior doesn't make it so. Nothing converges and performs as good as a model with attention in both training and inference. The difference is night and day. |
|
I think you're making the opposite case here without even realizing it yourself. The Transformer was proposed in 2017, with the softmax-based attention being proposed in 2014 but it wasn't until 2023 when GPT-3 et al took the world by storm that people started really using the Transformer as it is today.
The timeline going from being proposed to being used in production models also took multiple steps, with things improving with each step. It's a iterative process, not "dump -> done".
> Simply believing a architecture is superior doesn't make it so
I agree. But also I agree the opposite, that just because you believe it's not superior, doesn't make it so. If you claim "RWKV performs poorly in practice", I expect you to have something to back that up, definitely more than "Just because you believe it's superior doesn't make it so"