Hacker News new | ask | show | jobs
by espadrine 617 days ago
It is a neat approach, but one that comes with a tradeoff, IIUC: doubling the key heads.

I wonder if a different approach without that issue exists. For instance, using max(0, exp(x)-1) instead of exp(x) in the softmax attention formula. That way when the query is orthogonal to the key (or worse), it does not contribute.

1 comments

> using max(0, exp(x)-1) instead of exp(x)

Won't this cause the gradient to vanish on the left half, causing problems with training?

That is a concern that is shared with ReLU. But since the weights are shared across the context/minibatch, perhaps that would not be an issue, similar to ReLU.