|
|
|
|
|
by kouteiheika
806 days ago
|
|
> The claim that RWKV isn't paralleizable during training also seems to be refuted in their readme. RNNs are trivially parallizable (I've done it myself), as long as you're training them on multiple documents in parallel and have enough memory for the state for each document. You just train them 1 token at a time across N documents, instead of the transformer-like N tokens at a time across 1 document. |
|