|
|
|
|
|
by impossiblefork
392 days ago
|
|
Obviously the authors emphasize that it can make RNNs a competitor for big transformers, but it also means you can do things like feed back part of the output of a transformer into the input of the transformer at the next step, or other ways of making transformers into RNNs, so RNNs don't have to be all about speed. I think this has every chance of being an enabler for much more powerful architectures. Depth of a transformer is the number of layers. Depth of a transformer with a recurrent connection from the previous token output to the current input is the number of layers times the timestep. If it works as well as I imagine it's going to make for much more powerful models. |
|