Hacker News new | ask | show | jobs
by evolvingstuff 997 days ago
Why would RWKV have a particular advantage in this context? (I may be missing some key intuitions)
1 comments

RNN inference on a smaller edge controller (all history is cached in a single state point for each layer, so much less memory and computation requirements IIRC) :')

Very mobile-device and battery-powered systems friendly. :')))) ;'DDDD

I havent yet fully grokked RWKV..

Just how much compute/memory are we saving here?

My understanding is that a 1BN transformer is about 2BN flops/inference, so about 1TFLOP for a 500 sequence of inferences (and also about several GB of memory)

What would be the equivalent RWKV (let ignore the inevitable loss penalty which could be significant..)

It's an RNN, there is no N^2 component over time.

It only requires the previous state.

(there's a discord, you should join it with further questions! I unfortunately am not as informed as I should be on this one, other than the fact that it is _very_ mobile friendly). The performance diff is slight but not too bad really, all things considered. And I think it comes out on top for raw efficiency per parameter/flop, IIRC.

An interesting concept, for sure! :'DDDD :'))))

Sigh. Do discussions about RWKV always end with suggestions that I join the Discord? If I do join the Discord, will I soon begin suggesting that others join the Discord as well? What I mean is, I've seen this come up a few times on HN and discussions usually end prematurely with suggestions to join the Discord. [0]

If this technique is good, I'll wait until I can learn about it without joining the Discord.

[0]: https://news.ycombinator.com/item?id=35508692