Hacker News new | ask | show | jobs
by WithinReason 613 days ago
We empirically find that the setting λᵢₙᵢₜ = 0.8 − 0.6 × exp(−0.3 · (l − 1)) works well in practice

I wonder about the story behind that formula...

3 comments

Hmm, 0.8 works well, but let's try setting lower layers to lower initial value. Let's say 0.2. Ok, I need a formula that will go between 0.2 and 0.8, slowly approaching 0.8. Starts fiddling with numbers for 20min, I guess this can work.
Sure, but in research show some comparisons
In practice there's always a trade off between getting some result out and published and rigorously exploring every avenue of optimisation in research. Sometimes you have to say 'this is good enough and long enough already'.
Right, but it has to be *good enough*. My issue isn't that they didn't do more work, my issue is that they didn't even report work that they did do that communicates the impact of the literal proposed method.

https://news.ycombinator.com/item?id=41783013

A whole lot of things are tuned optimally by rotating an analog dial until things look / sound right.
Looks like this makes (at least initially in training) the “negative” attention term smaller in the early layers (smaller l) compared to later layers (larger l). Which I guess makes sense: you probably want to attend a little bit to everything before concluding that it’s really a few spots you should look at.

(Although it seems the author do not discuss this choice anywhere in the paper?)