|
|
|
|
|
by ctoa
177 days ago
|
|
It's sort of an RNN, but it's also basically a transformer with shared layer weights. Each step is equivalent to one transformer layer, the computation for n steps is the same as the computation for a transformer with n layers. The notion of context window applies to the sequence, it doesn't really affect that, each iteration sees and attends over the whole sequence. |
|
> UTs combine the parallelizability and global receptive field of feed-forward sequence models like the Transformer with the recurrent inductive bias of RNNs.
Very interesting, it seems to be an “old” architecture that is only now being leveraged to a promising extent. Curious what made it an active area (with the works of Samsung and Sapient and now this one), perhaps diminishing returns on regular transformers?
0: https://arxiv.org/abs/1807.03819