Hacker News new | ask | show | jobs
by cs702 991 days ago
Wouldn't work. Imagine a sequence with 100 tokens, fed to a model with 10 layers, each with a sliding attention window spanning 5 tokens. The top layer's final sliding window can only see 5 trailing tokens, each of which can only see 5 trailing tokens in the previous layer, and so on, for a total of 50 trailing tokens (plus the initial token) of maximum trailing context in the top layer.

It's an inherent limitation of this approach.

1 comments

How about neutral value padding at the other end?

I am having trouble visualizing this.