|
|
|
|
|
by immibis
618 days ago
|
|
The path is arbitrarily long, not wide. It is possible for an RNN to be made that remembers the first word of the input, no longer how long the input is. This is not possible with a transformer, so we know they are fundamentally different. |
|
What the RNN must be doing reminds me of "sliding window attention" --- the model learns how to partition its state between short- and long-range memories to minimize overall loss. The two approaches seem related, perhaps even equivalent up to implementation details.