|
|
|
|
|
by anon291
990 days ago
|
|
I think many people believe you. The main advantage of transformers over RNNs is training parallelization. RNNs are hard because training suffers from vanishing gradients and also because it's hard to get full utilization (needs large batches to get good utilization). The existence of models like RWKV indicates that there is potentially a future in training like a transformer but inferring like an RNN. |
|