Hacker News new | ask | show | jobs
by Nevermark 994 days ago
The end of the sequence could be padded with constant "neutral" values?
1 comments

Wouldn't work. Imagine a sequence with 100 tokens, fed to a model with 10 layers, each with a sliding attention window spanning 5 tokens. The top layer's final sliding window can only see 5 trailing tokens, each of which can only see 5 trailing tokens in the previous layer, and so on, for a total of 50 trailing tokens (plus the initial token) of maximum trailing context in the top layer.

It's an inherent limitation of this approach.

How about neutral value padding at the other end?

I am having trouble visualizing this.