|
|
|
|
|
by jimmyl02
806 days ago
|
|
From what I know about RWKV, it's mostly a one man effort and doesn't have the same data pipeline / resources as most major labs. It's a bit unfortunate but I'm curious about the performance given the same training corpus as OpenAI's GPTs. Maybe some labs have tried internally but haven't released results? On the other hand it makes sense to invest more money into transformer training runs as they have been proven to work. They really burst onto the scene and brought back RNNs in the world of transformers. The claim that RWKV isn't paralleizable during training also seems to be refuted in their readme. I'd guess it's generalizable performance as there is a difference between doing well on benchmarks and being usable. Personally I've tried running the weights a long time ago when it was first released and the results weren't usable but I'm sure there has been considerable progress since then. |
|
RNNs are trivially parallizable (I've done it myself), as long as you're training them on multiple documents in parallel and have enough memory for the state for each document. You just train them 1 token at a time across N documents, instead of the transformer-like N tokens at a time across 1 document.