Hacker News new | ask | show | jobs
by algo_trader 999 days ago
I havent yet fully grokked RWKV..

Just how much compute/memory are we saving here?

My understanding is that a 1BN transformer is about 2BN flops/inference, so about 1TFLOP for a 500 sequence of inferences (and also about several GB of memory)

What would be the equivalent RWKV (let ignore the inevitable loss penalty which could be significant..)

1 comments

It's an RNN, there is no N^2 component over time.

It only requires the previous state.

(there's a discord, you should join it with further questions! I unfortunately am not as informed as I should be on this one, other than the fact that it is _very_ mobile friendly). The performance diff is slight but not too bad really, all things considered. And I think it comes out on top for raw efficiency per parameter/flop, IIRC.

An interesting concept, for sure! :'DDDD :'))))

Sigh. Do discussions about RWKV always end with suggestions that I join the Discord? If I do join the Discord, will I soon begin suggesting that others join the Discord as well? What I mean is, I've seen this come up a few times on HN and discussions usually end prematurely with suggestions to join the Discord. [0]

If this technique is good, I'll wait until I can learn about it without joining the Discord.

[0]: https://news.ycombinator.com/item?id=35508692