|
|
|
|
|
by jszymborski
615 days ago
|
|
Your comment brings up two interesting variants that could be interesting if your goal is to increase the sparsity of the attention: - Rectify the difference of the softmaxes. (min(0, s(A1) - lambda s(A2))) - Apply the Heaviside function to the second softmax. (softmax(A1) - lambda H(s(A1) - lambda s(A2)) The second one being a bit more drastic and maybe harder to train. |
|