| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by serverholic 1226 days ago

I'm skeptical that RNNs alone will outperform transformers. Perhaps some sort of transformer + rnn combo?

The issue with RNNs is that feedback signals decay over time, so the model will be biased towards more recent words.

Transformers on the other hand don't have this bias. A word 10,000 words ago could be just as important as a word 5 words ago. The tradeoff is that the context window for transformers is a hard cutoff point.

3 comments

pizza 1226 days ago

I think RWKV ameliorates this to some degree:

How it works: RWKV gathers information to a number of channels, which are also decaying with different speeds as you move to the next token. It's very simple once you understand it.

RWKV is parallelizable because the time-decay of each channel is data-independent (and trainable). For example, in usual RNN you can adjust the time-decay of a channel from say 0.8 to 0.5 (these are called "gates"), while in RWKV you simply move the information from a W-0.8-channel to a W-0.5-channel to achieve the same effect.

link

gwern 1225 days ago

https://twitter.com/arankomatsuzaki/status/16390003799784038...

link

solomatov 1225 days ago

I don't see why this can't be done with transformers. I guess, somebody already tried doing this.

link

solomatov 1225 days ago

As far as I remember in RNN times, the best models were RNNs with attention. Does this thing has any attention mechanism? If it does, then it has the same problem with the O(n^2) computation where n is the window size. My understanding is that transfers are superior due to the fact that they are much faster to train/evaluate than RNNs.

link

yieldcrv 1225 days ago

What does RNN stand for?

edit: recurrent neural network

link