|
|
|
|
|
by aljungberg
1250 days ago
|
|
THe RWKV model seems really cool. If you could get transformer-like performance with an RNN, the “hard coded” context length problem might go away. (That said, RNNs famously have infinite context in theory and very short context in reality.) Is there a primer for what RWKV does differently? According to the Github page it seems the key is multiple channels of state with different decaying rates, giving I assume, a combination of short and long term memory. But isn’t that what LSTMs were supposed to do too? |
|
[1]: https://arxiv.org/abs/1901.02860