| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by stellalo 660 days ago
	Looks like this makes (at least initially in training) the “negative” attention term smaller in the early layers (smaller l) compared to later layers (larger l). Which I guess makes sense: you probably want to attend a little bit to everything before concluding that it’s really a few spots you should look at. (Although it seems the author do not discuss this choice anywhere in the paper?)