Hacker News new | ask | show | jobs
by NLPaep 1023 days ago
RWKV performs poorly in practice.
2 comments

You'd like to expand that a little, suggestively how you know this? Or you're expecting rebuttals that amount to "No, it doesn't"?
Occam's razor tells us that if it's a great architecture/technical breakthrough, it would have taken the world by storm by now. Similar to the original transformer paper and model. Since 2017, the only successful models are variations of transformers. RNNs are no where in the picture.

Simply believing a architecture is superior doesn't make it so. Nothing converges and performs as good as a model with attention in both training and inference. The difference is night and day.

> Occam's razor tells us that if it's a great architecture/technical breakthrough, it would have taken the world by storm by now. Similar to the original transformer paper and model.

I think you're making the opposite case here without even realizing it yourself. The Transformer was proposed in 2017, with the softmax-based attention being proposed in 2014 but it wasn't until 2023 when GPT-3 et al took the world by storm that people started really using the Transformer as it is today.

The timeline going from being proposed to being used in production models also took multiple steps, with things improving with each step. It's a iterative process, not "dump -> done".

> Simply believing a architecture is superior doesn't make it so

I agree. But also I agree the opposite, that just because you believe it's not superior, doesn't make it so. If you claim "RWKV performs poorly in practice", I expect you to have something to back that up, definitely more than "Just because you believe it's superior doesn't make it so"

I've now presented on this model and worked with it. It's not phenomenally better than other models but has some attractive scaling and speed properties that may or may not be worth the trade off relative to its draw backs which I detail in my other comment in the GP's thread.
This is really a strange take. Encoder-only transformers like BERT and RoBERTa have been wildly popular in NLP for years now, replacing pretty much every model that came prior to it and beating pretty much all traditional NLP benchmarks (tagging, parsing, etc.).
Also BERT was integrated into Google search in 2019 (see https://blog.google/products/search/search-language-understa...)
> The Transformer was proposed in 2017, with the softmax-based attention being proposed in 2014 but it wasn't until 2023 when GPT-3 et al took the world by storm that people started really using the Transformer as it is today.

The last part is COMPLETELY false. Laughably so.

Transformers have been industrially ubiquitous for a lot longer than GPT-3, initially in machine translation.
"a lot longer " quite the statement !

reading here says "the behavior and qualities of these large models is poorly understood"

prove me wrong?

ps- I agree that BERT-related models have been "wildly popular in NLP for years now"

Ideas and trends move much slower, than we think, especially outside of silicon valley, or the current bubbles we as individual are in.

We as humans have a tendency of not wanting things to change, nor accept new ideas that challenge our existing ideas. And shape our memories accordingly.

Transformer example: The world didn't switch over immediately to transformer's in 2017, in fact the original model had issues converging past a 100M params that needed to be sorted out. And arguably picked up steam mostly after BERT a year later.

Day-to-day example: The average day to day person, outside the tech bubble, still have not tried ChatGPT - and unfortunately it has not taken the world by storm yet.

So while it is true that traditional RNNs with LSTM do not converge as well. The changes presented here are substantial (we removed LSTM for example)

And it's not a question of belief, RWKV code is fully opensource, in public and available. All claims can be tested. With results that can be replicated. By anyone who is willing to put in the time to do so.

Benchmark results in publications show it being confused in chat—like settings and answering questions incorrectly
Do let me know your use case where it perform poorly?

And I would look into getting it better supported in upcoming datasets we train with =)

( I mean this genuinely, we have an entire discord channel for #failed-task to help us keep track and push for the model to be better )