|
|
|
|
|
by toxik
716 days ago
|
|
Transformers have finite context, RNNs don’t. In practice the RNN gradient signal is limited by back propagation through time, it decays. This is in fact the whole selling point of transformers; association is not harder or easier in near/short distance. But in theory a RNN can remember infinitely far away. |
|