|
|
|
|
|
by trott
623 days ago
|
|
> This is no different than a transformer, which, after all, is bound by a finite state, just organized in a different manner. It's not just a matter of organizing things differently. Suppose your network dimension and sequence length are both X. Then your memory usage (per layer) will be O(X^2), while your training update cost will be O(X^3). That's for both Transformers and RNNs. However, at the end of the sequence, a Transformer layer can look back see O(X^2) numbers, while an RNN can only see O(X) numbers. |
|
(this is from the Based paper: https://arxiv.org/pdf/2402.18668)