|
|
|
|
|
by algo_trader
999 days ago
|
|
I havent yet fully grokked RWKV.. Just how much compute/memory are we saving here? My understanding is that a 1BN transformer is about 2BN flops/inference, so about 1TFLOP for a 500 sequence of inferences (and also about several GB of memory) What would be the equivalent RWKV (let ignore the inevitable loss penalty which could be significant..) |
|
It only requires the previous state.
(there's a discord, you should join it with further questions! I unfortunately am not as informed as I should be on this one, other than the fact that it is _very_ mobile friendly). The performance diff is slight but not too bad really, all things considered. And I think it comes out on top for raw efficiency per parameter/flop, IIRC.
An interesting concept, for sure! :'DDDD :'))))