|
|
|
|
|
by cs702
991 days ago
|
|
Wouldn't work. Imagine a sequence with 100 tokens, fed to a model with 10 layers, each with a sliding attention window spanning 5 tokens. The top layer's final sliding window can only see 5 trailing tokens, each of which can only see 5 trailing tokens in the previous layer, and so on, for a total of 50 trailing tokens (plus the initial token) of maximum trailing context in the top layer. It's an inherent limitation of this approach. |
|
I am having trouble visualizing this.