|
|
|
|
|
by stellalo
613 days ago
|
|
Looks like this makes (at least initially in training) the “negative” attention term smaller in the early layers (smaller l) compared to later layers (larger l). Which I guess makes sense: you probably want to attend a little bit to everything before concluding that it’s really a few spots you should look at. (Although it seems the author do not discuss this choice anywhere in the paper?) |
|