Hacker News new | ask | show | jobs
by stellalo 613 days ago
Looks like this makes (at least initially in training) the “negative” attention term smaller in the early layers (smaller l) compared to later layers (larger l). Which I guess makes sense: you probably want to attend a little bit to everything before concluding that it’s really a few spots you should look at.

(Although it seems the author do not discuss this choice anywhere in the paper?)