RNN inference on a smaller edge controller (all history is cached in a single state point for each layer, so much less memory and computation requirements IIRC) :')
Very mobile-device and battery-powered systems friendly. :')))) ;'DDDD
My understanding is that a 1BN transformer is about 2BN flops/inference, so about 1TFLOP for a 500 sequence of inferences (and also about several GB of memory)
What would be the equivalent RWKV (let ignore the inevitable loss penalty which could be significant..)
(there's a discord, you should join it with further questions! I unfortunately am not as informed as I should be on this one, other than the fact that it is _very_ mobile friendly). The performance diff is slight but not too bad really, all things considered. And I think it comes out on top for raw efficiency per parameter/flop, IIRC.
Sigh. Do discussions about RWKV always end with suggestions that I join the Discord? If I do join the Discord, will I soon begin suggesting that others join the Discord as well? What I mean is, I've seen this come up a few times on HN and discussions usually end prematurely with suggestions to join the Discord. [0]
If this technique is good, I'll wait until I can learn about it without joining the Discord.
Very mobile-device and battery-powered systems friendly. :')))) ;'DDDD