|
|
|
|
|
by cs702
990 days ago
|
|
On a first quick pass, this looks so good that I'm wondering if it's too good to be true! But the work looks to be of decent quality and the technique is remarkably straightforward: The idea is to apply attention over the first token and a sliding context window, ignoring everything in-between, in each layer. By implication, each layer must be gradually shifting relevant information forward in the sequence, enabling the top layer's ending sliding attention window to see it. The only caveat I can think of is that the sliding windows won't be able to shift all important information forward when the span of all sliding windows isn't sufficient to span the entire sequence -- for example, when model depth × window length < sequence length, if all windows have the same length. |
|