| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by VHRanger 806 days ago

Like RWKV and Mamba, this is mixing some RNN properties to avoid the issues transformers have.

However I'm curious about their scaling claims. They have a plot that shows how the model scales in training with the FLOPs you throw at it.

But the issue we should rather be concerned with is the wall time of training for a set amount of hardware.

Back in 2018, we could train medium sized RNNs, the issue was with wall time of training and training stability.

3 comments

whimsicalism 806 days ago

transformers were also just better at the LM task than 2018 RNNs for equal amount of flop training

link

VHRanger 806 days ago

Yeah, that's just the training stability part to my knowledge

link

whimsicalism 806 days ago

they're also just less capable models. like just adding attention on top of an RNN made them a lot better

link

SpaceManNabs 806 days ago

Calculating self-attention is still quadratic though. So you get the negatives of transformers there too.

link

foota 806 days ago

Do you know the downside with RWKV? Based on how they present it, it seems like the best thing since sliced bread, but I would have assumed that it would have been widely adopted if that were the case.

link

kouteiheika 806 days ago

The downside is that it's bad (like, really bad) on a certain subset of tasks. I once trained RWKVv4 model on a machine translation task and no matter how much I scaled it up it just didn't work at all, while an equivalent transformer did the job without a problem.

Intuitively this does make sense, because a transformer can at any time "look back" at the source sentence and at what it has previously generated (due to its attention mechanism) for every token it outputs, while an RNN like RWKV has to compress this into its internal state which is both lossy and limited in size.

I haven't looked at the new versions of RWKV (apparently we're at v6 now), but hopefully it performs better now. In the end I think that a hybrid architecture probably makes the most sense - have some sort of an attention mechanism for the near context, and an RNN-like state for far context, and that would give you the best of both worlds.

link

inciampati 805 days ago

What about multiple passes over the data? Make it recurse.

link

kouteiheika 805 days ago

I also tried that - try to get it to iteratively "refine" its translation. I don't remember all of the details at this point, but in general it didn't help much. (Although maybe I just did it suboptimally and there might have been a better way to do it.)

I'm guessing scaling the model up massively would probably make it work in one shot (so that whatever it was translating would fit into its state), but I didn't really have the compute to try that.

link

eyegor 805 days ago

Lstm?

link

shawntan 805 days ago

Not sure if this is the type of answer you're looking for, but RWKV is not really recurrent the same way RNNs are recurrent. This quasi-recurrentness allows it and its comrades to use algorithms like parallel SCAN to achieve log N complexity when parallelised. But you pay for that in terms of state-tracking.

There's a cool talk here if you care to know the details:https://www.youtube.com/watch?v=4-VXe1yPDjk

link

VHRanger 806 days ago

It seems only OK as a model? Looking at the LLM chat leaderboard it's 71st and the 14B version is worse than a lot of 7B models:

https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboar...

Also, llama.cpp makes inference accessible for a lot of people, and it's not available for RWKV.

Not to knock on the model, I'm sure it's good. I also like that it's a succesful example of citizen science.

It's just not popular enough to have the inference infrastructure transformers have, not established enough to attract enough money to get 60B+ models trained, and so on.

link

WanderPanda 806 days ago

This leaderboard is not the best for comparing model architectures, the dataset and finetuning have too much influence. I think perplexity on a particular dataset would be a better way to compare

link

logicchains 805 days ago

>Also, llama.cpp makes inference accessible for a lot of people, and it's not available for RWKV.

It absolutely is: https://github.com/RWKV/rwkv.cpp .

link

whimsicalism 806 days ago

i believe it is undertrained, at minimum

link

jimmyl02 806 days ago

From what I know about RWKV, it's mostly a one man effort and doesn't have the same data pipeline / resources as most major labs. It's a bit unfortunate but I'm curious about the performance given the same training corpus as OpenAI's GPTs. Maybe some labs have tried internally but haven't released results? On the other hand it makes sense to invest more money into transformer training runs as they have been proven to work.

They really burst onto the scene and brought back RNNs in the world of transformers. The claim that RWKV isn't paralleizable during training also seems to be refuted in their readme. I'd guess it's generalizable performance as there is a difference between doing well on benchmarks and being usable. Personally I've tried running the weights a long time ago when it was first released and the results weren't usable but I'm sure there has been considerable progress since then.

link

kouteiheika 806 days ago

> The claim that RWKV isn't paralleizable during training also seems to be refuted in their readme.

RNNs are trivially parallizable (I've done it myself), as long as you're training them on multiple documents in parallel and have enough memory for the state for each document. You just train them 1 token at a time across N documents, instead of the transformer-like N tokens at a time across 1 document.

link

ianbutler 806 days ago

RWKV is parallel at the level of the sequence like a transformer. Its formulation allows for each timestep t to be calculated in parallel except for a single serial scan at the end for aggregation which they use a custom cuda kernel to do.

link

kouteiheika 805 days ago

I know. I trained RWKV myself using both methods, like a transformer and like an RNN.

Ultimately it probably doesn't matter that you can train it like a transformer because you can just train it in parallel on multiple documents simultaneously one token at a time, and, at least from my experience, this worked just as well, if not better.

Plus, doing it this way is more general because you don't need any custom kernels to do it, and it also helps the model to learn to deal with an "infinite" context better (while if you train it like a transformer its performance will regress once you evaluate it outside of the context window on which you've trained it, at least from what I've seen in my training runs).

link

nmfisher 805 days ago

I played around with RWKV some time ago (maybe early 2023?) with similarly disappointing results, but my suspicion was that this was a dataset/training issue, not an architectural one. Leaderboard performance has improved a lot since then, and anecdotally, I've seen/heard some quite decent RWKV TTS experiments, so I'm bullish.

Also, the team has incorporated/raised money from investors (recursal.ai), so it's no longer a one man effort.

link

GaggiX 806 days ago

The paper shows that the speed is comparable to transformer models, faster with smaller with "long" sequence length like 8k.

link