|
|
|
|
|
by slashdave
618 days ago
|
|
> the amount of information the model retains about it is bounded by whatever is in its hidden state This is no different than a transformer, which, after all, is bound by a finite state, just organized in a different manner. |
|
It's not just a matter of organizing things differently. Suppose your network dimension and sequence length are both X.
Then your memory usage (per layer) will be O(X^2), while your training update cost will be O(X^3). That's for both Transformers and RNNs.
However, at the end of the sequence, a Transformer layer can look back see O(X^2) numbers, while an RNN can only see O(X) numbers.