Hacker News new | ask | show | jobs
by impossiblefork 392 days ago
Obviously the authors emphasize that it can make RNNs a competitor for big transformers, but it also means you can do things like feed back part of the output of a transformer into the input of the transformer at the next step, or other ways of making transformers into RNNs, so RNNs don't have to be all about speed.

I think this has every chance of being an enabler for much more powerful architectures.

Depth of a transformer is the number of layers. Depth of a transformer with a recurrent connection from the previous token output to the current input is the number of layers times the timestep.

If it works as well as I imagine it's going to make for much more powerful models.

1 comments

exactly