Hacker News new | ask | show | jobs
by blackbear_ 618 days ago
With a single softmax you cannot predict exactly 0, but only very small numbers. When you have a large number of values to add up, this "poisons" the output with a lot of irrelevant stuff (the noise mentioned in the paper).

To make things worse, low attention values will have very low gradient, thus needing a lot of weight updates to undo that kind of mistakes. On the other hand, subtracting the output of two softmax allows the model to predict a weight of exactly zero for some of the values, while keeping a reasonable gradient flowing through.

So the model already knows what is noise, but a single softmax makes it harder to exclude it.

Moreover, with a single softmax the output of all heads is forced to stay in the convex hull of the value vectors, whereas with this variant each head can choose its own lambda, thus shifting the "range" of the outputs outside the convex hull pre-determined by the values. This makes the model as a whole more expressive.

6 comments

While I don't discount the value of this, can you expand on the meaning of your claim that it makes the model 'more expressive'

Everything I am seeing in this paper is related to reduced size and noise, which implies a reduction in expressiveness.

The improvement in needle and a haystack, benchmarks on multi-hop questions of in corpus data and multishot in-context learning points to this.

This is a wonderful thing if robustness is more important than generality, but it doesn't address trimming away activations that may be spurious in the general use case but may improve an individual domain specificity.

Context would dramatically impact what tradeoffs and more desireble, and noise is probably never desirable. But the ability of this paper to enable bit size for inference points to a reduction in expressiveness.

Perhaps I am too focused on generalization?

What I meant is that by changing lambda each attention head is able to put its outputs in a subspace that is different than that of the other heads. This means that the outputs of different heads do not mingle with each other, and it's easier for the following layer to pick them apart. So I was thinking at increased expressiveness because the attention output can in principle cover a larger volume.

Maybe expressiveness is not the right term, or not the main consequence. I could imagine that having different subspaces like that also introduces a degree of robustness to out-of-distribution inputs, as this would make it harder for the outputs of one attention head to shift towards the in-distribution outputs of another head, and thus for the following layer to confuse them.

I'm able to follow most of what you're saying. It's unclear to me what "convex hull" means though.

Also, where is each softmax happening here? For each attention head?

> It's unclear to me what "convex hull" means though.

The convex hull (https://en.wikipedia.org/wiki/Convex_hull) of a set is the smallest convex shape that includes that set. Geometrically, it's what you'd get if you "shrink wrapped" the thing you're looking at: edges still protrude, but any indentations get smoothed over.

In this context, the grandparent comment is pointing out that with a traditional transformer block, the resulting computed value for a token can never "stick out" past some weighted average of the values of attended-to tokens, but this differential attention formalism allows that result.

The softmax value y is a linear combination of the vectors you're attending over: y = a1v1 + a2v2 + ... + an*vn where a_i >= 0 and sum(a_i) = 1.

Then y is a convex combination of the v_i, and sits in the convex hull of the v_i.

The convex hull of a set of points is the region "between" those points. So the convex hull of three points (that do not lie on the same line) is a triangle with those three points as vertices. If you add a fourth point inside the triangle, the convex hull remains the same, but if you add it outside then the convex hull becomes the four-sided region with those points as vertices.

In the context of standard transformer attention, each output lies in the convex hull ("somewhere between") the input values. With the modification of this paper, the input values can be scaled a little so that the output of different heads can be in different "regions" and thus do not interfere with each other (so yes to your third question, the two softmaxes are performed separately for each head).

O_i = softmax(...) * V_i and softmax is between 0 and 1, so O_i = alpha * V_i for some alpha between 0 and 1 so that makes it convex, and it makes the O_i just a shrunken version of V_i. Whereas if you have the diff of softmaxes, you get O_i = (alpha - beta) * V_i, which can range from -V_i to +V_i, so its output could rescale /or/ flip V_i. And yes this is happening in every head in parallel, then they get summed.
By simply inputting your comment in to 4o, with no other context about the paper, I was able to get a pretty good analysis of the dual-head concept's implications.

https://chatgpt.com/share/67058973-ba94-8008-bed7-c7f9d08dc5...

Uh, this is extracting a LOT from very little data. I don't understand where it's coming from but it's explanation just keeps going into more and more detail ... that doesn't seem to follow from the data it's got.

I just don't see how you could answer these questions without trying it out. And chatgtp DEFINITELY isn't doing that.

Plus the obvious question I'd pose is not in there. What's the difference in performance between this trick and just "softmax() - 0.5 * 2" ? That seems very relevant.

Could you help explain how we would achieve an attention score of exactly 0, in practice? Here’s my take:

If we’re subtracting one attention matrix from another, we’d end up with attention scores between -1 and 1, with a probability of effectively 0 for any single entry to exactly equal 0.

What’s more, the learnable parameter \lambda allows for negative values. This would allow the model to learn to actually add the attention scores, making a score of exactly 0 impossible.

Your comment brings up two interesting variants that could be interesting if your goal is to increase the sparsity of the attention:

- Rectify the difference of the softmaxes. (min(0, s(A1) - lambda s(A2)))

- Apply the Heaviside function to the second softmax. (softmax(A1) - lambda H(s(A1) - lambda s(A2))

The second one being a bit more drastic and maybe harder to train.

It is a neat approach, but one that comes with a tradeoff, IIUC: doubling the key heads.

I wonder if a different approach without that issue exists. For instance, using max(0, exp(x)-1) instead of exp(x) in the softmax attention formula. That way when the query is orthogonal to the key (or worse), it does not contribute.

> using max(0, exp(x)-1) instead of exp(x)

Won't this cause the gradient to vanish on the left half, causing problems with training?

That is a concern that is shared with ReLU. But since the weights are shared across the context/minibatch, perhaps that would not be an issue, similar to ReLU.
> predict a weight of exactly zero for some of the values

Wouldn’t this be pretty unlikely, though?

Quite the opposite — if you have a long sequence only a smattering of the words will influence the meaning of the current word. Everything else is “noise”.

Attention is really good at finding this smattering of words (ie assign most weight there). But it struggles to put exactly 0 on the other words.

I mean wouldn’t it be unlikely that

SoftmaxA[n] - SoftmaxB[n] is exactly 0?

Even if 2 attention layers learn two different things, I would imagine the corresponding weights in each layer wouldn’t exactly cancel each other out.

why say lot word when few word do
Few word no do tho
U+1FAE5
Phew!
is this the same reason why fp4 with more parameters beats fp16 with less?