|
|
|
|
|
by serverholic
1179 days ago
|
|
I'm skeptical that RNNs alone will outperform transformers. Perhaps some sort of transformer + rnn combo? The issue with RNNs is that feedback signals decay over time, so the model will be biased towards more recent words. Transformers on the other hand don't have this bias. A word 10,000 words ago could be just as important as a word 5 words ago. The tradeoff is that the context window for transformers is a hard cutoff point. |
|
How it works: RWKV gathers information to a number of channels, which are also decaying with different speeds as you move to the next token. It's very simple once you understand it.
RWKV is parallelizable because the time-decay of each channel is data-independent (and trainable). For example, in usual RNN you can adjust the time-decay of a channel from say 0.8 to 0.5 (these are called "gates"), while in RWKV you simply move the information from a W-0.8-channel to a W-0.5-channel to achieve the same effect.