| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by schopra909 620 days ago
	Quite the opposite — if you have a long sequence only a smattering of the words will influence the meaning of the current word. Everything else is “noise”. Attention is really good at finding this smattering of words (ie assign most weight there). But it struggles to put exactly 0 on the other words.

2 comments

dartos 620 days ago

I mean wouldn’t it be unlikely that

SoftmaxA[n] - SoftmaxB[n] is exactly 0?

Even if 2 attention layers learn two different things, I would imagine the corresponding weights in each layer wouldn’t exactly cancel each other out.

link

absoflutely 620 days ago

why say lot word when few word do

link

dartos 620 days ago

Few word no do tho

link

kridsdale3 620 days ago

U+1FAE5

link

1024core 620 days ago

Phew!

link