| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by ethan_smith 322 days ago
	Attention weights can still assign non-zero probability to irrelevant tokens since the mechanism optimizes for prediction rather than semantic relevance, and these irrelevant tokens can create interference in the hidden state representations.