Hmm, 0.8 works well, but let's try setting lower layers to lower initial value. Let's say 0.2. Ok, I need a formula that will go between 0.2 and 0.8, slowly approaching 0.8. Starts fiddling with numbers for 20min, I guess this can work.
In practice there's always a trade off between getting some result out and published and rigorously exploring every avenue of optimisation in research. Sometimes you have to say 'this is good enough and long enough already'.
Right, but it has to be *good enough*. My issue isn't that they didn't do more work, my issue is that they didn't even report work that they did do that communicates the impact of the literal proposed method.
Looks like this makes (at least initially in training) the “negative” attention term smaller in the early layers (smaller l) compared to later layers (larger l). Which I guess makes sense: you probably want to attend a little bit to everything before concluding that it’s really a few spots you should look at.
(Although it seems the author do not discuss this choice anywhere in the paper?)