| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by cs702 991 days ago
	Wouldn't work. Imagine a sequence with 100 tokens, fed to a model with 10 layers, each with a sliding attention window spanning 5 tokens. The top layer's final sliding window can only see 5 trailing tokens, each of which can only see 5 trailing tokens in the previous layer, and so on, for a total of 50 trailing tokens (plus the initial token) of maximum trailing context in the top layer. It's an inherent limitation of this approach.

1 comments

Nevermark 991 days ago

How about neutral value padding at the other end?

I am having trouble visualizing this.