| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by pizza 1226 days ago

I think RWKV ameliorates this to some degree:

How it works: RWKV gathers information to a number of channels, which are also decaying with different speeds as you move to the next token. It's very simple once you understand it.

RWKV is parallelizable because the time-decay of each channel is data-independent (and trainable). For example, in usual RNN you can adjust the time-decay of a channel from say 0.8 to 0.5 (these are called "gates"), while in RWKV you simply move the information from a W-0.8-channel to a W-0.5-channel to achieve the same effect.

2 comments

gwern 1226 days ago

https://twitter.com/arankomatsuzaki/status/16390003799784038...

link

solomatov 1226 days ago

I don't see why this can't be done with transformers. I guess, somebody already tried doing this.

link