Hacker News new | ask | show | jobs
by cs702 990 days ago
On a first quick pass, this looks so good that I'm wondering if it's too good to be true!

But the work looks to be of decent quality and the technique is remarkably straightforward:

The idea is to apply attention over the first token and a sliding context window, ignoring everything in-between, in each layer.

By implication, each layer must be gradually shifting relevant information forward in the sequence, enabling the top layer's ending sliding attention window to see it.

The only caveat I can think of is that the sliding windows won't be able to shift all important information forward when the span of all sliding windows isn't sufficient to span the entire sequence -- for example, when model depth × window length < sequence length, if all windows have the same length.

2 comments

Can't wait for the github repo adaptation of the method!
The end of the sequence could be padded with constant "neutral" values?
Wouldn't work. Imagine a sequence with 100 tokens, fed to a model with 10 layers, each with a sliding attention window spanning 5 tokens. The top layer's final sliding window can only see 5 trailing tokens, each of which can only see 5 trailing tokens in the previous layer, and so on, for a total of 50 trailing tokens (plus the initial token) of maximum trailing context in the top layer.

It's an inherent limitation of this approach.

How about neutral value padding at the other end?

I am having trouble visualizing this.