Hacker News new | ask | show | jobs
by shawntan 536 days ago
Although marketed as such, RWKV isn't really an RNN.

In the recent RWKV7 incarnation, you could argue it's a type of Linear RNN, but past versions had an issue of taking its previous state from a lower layer, allowing for parallelism, but makes it closer to a convolution than a recurrent computation.

As for 1), I'd like to believe so, but it's hard to get people away from the addictive drug that is the easily parallelised transformer, 2) (actual) RNNs and attention mechanisms to me seem fairly powerful (expressivity wise) and perhaps most acceptable by the community.

1 comments

Recent work by Feng et al from Bengio's lab focus on how attention can be formulated as an RNN ("Attention as RNN": https://arxiv.org/pdf/2405.13956) and how minimal versions of GRUs and LSTMs can be trained in parallel by removing some parameters ("Were RNNs All We Needed?" https://arxiv.org/pdf/2410.01201).

It's possible we start seeing more blended version of RNN/attention architecture exploring different LLM properties.

In particular, Aaren architecture in the former paper "can not only (i) be trained in parallel (like Transformers) but also (ii) be updated efficiently with new tokens, requiring only constant memory for inferences (like traditional RNNs)."

The formulations in attention as rnn have similar issues as rwkv. Fundamentally it's a question of what we call an RNN.

Personally I think it's important not to call some of these recent architectures RNNs because they have theoretical properties that do not match (read: they're worse) what we've "classically" called RNNs.

Ref: https://arxiv.org/abs/2404.08819

As a rule of thumb: you generally don't get parallelism for free, you pay for it with poorer expressivity.