Hacker News new | ask | show | jobs
by capableweb 1023 days ago
> Occam's razor tells us that if it's a great architecture/technical breakthrough, it would have taken the world by storm by now. Similar to the original transformer paper and model.

I think you're making the opposite case here without even realizing it yourself. The Transformer was proposed in 2017, with the softmax-based attention being proposed in 2014 but it wasn't until 2023 when GPT-3 et al took the world by storm that people started really using the Transformer as it is today.

The timeline going from being proposed to being used in production models also took multiple steps, with things improving with each step. It's a iterative process, not "dump -> done".

> Simply believing a architecture is superior doesn't make it so

I agree. But also I agree the opposite, that just because you believe it's not superior, doesn't make it so. If you claim "RWKV performs poorly in practice", I expect you to have something to back that up, definitely more than "Just because you believe it's superior doesn't make it so"

4 comments

I've now presented on this model and worked with it. It's not phenomenally better than other models but has some attractive scaling and speed properties that may or may not be worth the trade off relative to its draw backs which I detail in my other comment in the GP's thread.
This is really a strange take. Encoder-only transformers like BERT and RoBERTa have been wildly popular in NLP for years now, replacing pretty much every model that came prior to it and beating pretty much all traditional NLP benchmarks (tagging, parsing, etc.).
Also BERT was integrated into Google search in 2019 (see https://blog.google/products/search/search-language-understa...)
> The Transformer was proposed in 2017, with the softmax-based attention being proposed in 2014 but it wasn't until 2023 when GPT-3 et al took the world by storm that people started really using the Transformer as it is today.

The last part is COMPLETELY false. Laughably so.

Transformers have been industrially ubiquitous for a lot longer than GPT-3, initially in machine translation.
"a lot longer " quite the statement !

reading here says "the behavior and qualities of these large models is poorly understood"

prove me wrong?

ps- I agree that BERT-related models have been "wildly popular in NLP for years now"