|
|
|
|
|
by espadrine
617 days ago
|
|
It is a neat approach, but one that comes with a tradeoff, IIUC: doubling the key heads. I wonder if a different approach without that issue exists. For instance, using max(0, exp(x)-1) instead of exp(x) in the softmax attention formula. That way when the query is orthogonal to the key (or worse), it does not contribute. |
|
Won't this cause the gradient to vanish on the left half, causing problems with training?