| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by sabaimran 318 days ago
	> Despite being sparse, NSA surpasses Full Attention baseline on average across general benchmarks, long-context tasks, and reasoning evaluation. Isn't it very notable that the latency improvement didn't have a performance loss? I'm not super familiar with all the technical aspects, but that seems like it should be one of the main focuses of the paper.

2 comments

ethan_smith 318 days ago

The performance maintenance (or even improvement) isn't surprising - sparse attention can reduce noise by focusing only on relevant tokens. Traditional full attention dilutes focus by attending to everything equally, while NSA's pruning approach mimics how humans selectively process information.

link

laughingcurve 317 days ago

Yes that’s what makes it so interesting and novel you nailed it

link