|
|
|
|
|
by magicalhippo
614 days ago
|
|
An interesting aspect is that they don't do a plain subtraction, but rather subtract a portion of the second softmax. This makes sense, if one considers that the two copies are identical then the softmax outputs would be identical and the difference is zero everywhere. However, by subtracting a scaled copy, the normalization of the difference seems to really boost the signal value(s) over the "noise", making the signal stand out compared to pre-normalization. |
|