|
|
|
|
|
by VHRanger
806 days ago
|
|
Like RWKV and Mamba, this is mixing some RNN properties to avoid the issues transformers have. However I'm curious about their scaling claims. They have a plot that shows how the model scales in training with the FLOPs you throw at it. But the issue we should rather be concerned with is the wall time of training for a set amount of hardware. Back in 2018, we could train medium sized RNNs, the issue was with wall time of training and training stability. |
|