| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by WithinReason 660 days ago
	We empirically find that the setting λᵢₙᵢₜ = 0.8 − 0.6 × exp(−0.3 · (l − 1)) works well in practice I wonder about the story behind that formula...

3 comments

Kubuxu 660 days ago

Hmm, 0.8 works well, but let's try setting lower layers to lower initial value. Let's say 0.2. Ok, I need a formula that will go between 0.2 and 0.8, slowly approaching 0.8. Starts fiddling with numbers for 20min, I guess this can work.

link

godelski 660 days ago

Sure, but in research show some comparisons

link

physicsguy 659 days ago

In practice there's always a trade off between getting some result out and published and rigorously exploring every avenue of optimisation in research. Sometimes you have to say 'this is good enough and long enough already'.

link

godelski 659 days ago

Right, but it has to be *good enough*. My issue isn't that they didn't do more work, my issue is that they didn't even report work that they did do that communicates the impact of the literal proposed method.

https://news.ycombinator.com/item?id=41783013

link

kridsdale3 660 days ago

A whole lot of things are tuned optimally by rotating an analog dial until things look / sound right.

link

stellalo 660 days ago

Looks like this makes (at least initially in training) the “negative” attention term smaller in the early layers (smaller l) compared to later layers (larger l). Which I guess makes sense: you probably want to attend a little bit to everything before concluding that it’s really a few spots you should look at.

(Although it seems the author do not discuss this choice anywhere in the paper?)

link