| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by espadrine 617 days ago
	It is a neat approach, but one that comes with a tradeoff, IIUC: doubling the key heads. I wonder if a different approach without that issue exists. For instance, using max(0, exp(x)-1) instead of exp(x) in the softmax attention formula. That way when the query is orthogonal to the key (or worse), it does not contribute.

1 comments

smallnamespace 616 days ago

> using max(0, exp(x)-1) instead of exp(x)

Won't this cause the gradient to vanish on the left half, causing problems with training?

link

espadrine 616 days ago

That is a concern that is shared with ReLU. But since the weights are shared across the context/minibatch, perhaps that would not be an issue, similar to ReLU.

link