Hacker News new | ask | show | jobs
by msoad 617 days ago
Like most things in this new world of Machine Learning, I'm really confused why this works?

The analogy to noise-cancelling headphones is helpful but in that case we clearly know which is signal and which is noise. Here, if we knew why would we even bother to the noise-cancelling work?

9 comments

With a single softmax you cannot predict exactly 0, but only very small numbers. When you have a large number of values to add up, this "poisons" the output with a lot of irrelevant stuff (the noise mentioned in the paper).

To make things worse, low attention values will have very low gradient, thus needing a lot of weight updates to undo that kind of mistakes. On the other hand, subtracting the output of two softmax allows the model to predict a weight of exactly zero for some of the values, while keeping a reasonable gradient flowing through.

So the model already knows what is noise, but a single softmax makes it harder to exclude it.

Moreover, with a single softmax the output of all heads is forced to stay in the convex hull of the value vectors, whereas with this variant each head can choose its own lambda, thus shifting the "range" of the outputs outside the convex hull pre-determined by the values. This makes the model as a whole more expressive.

While I don't discount the value of this, can you expand on the meaning of your claim that it makes the model 'more expressive'

Everything I am seeing in this paper is related to reduced size and noise, which implies a reduction in expressiveness.

The improvement in needle and a haystack, benchmarks on multi-hop questions of in corpus data and multishot in-context learning points to this.

This is a wonderful thing if robustness is more important than generality, but it doesn't address trimming away activations that may be spurious in the general use case but may improve an individual domain specificity.

Context would dramatically impact what tradeoffs and more desireble, and noise is probably never desirable. But the ability of this paper to enable bit size for inference points to a reduction in expressiveness.

Perhaps I am too focused on generalization?

What I meant is that by changing lambda each attention head is able to put its outputs in a subspace that is different than that of the other heads. This means that the outputs of different heads do not mingle with each other, and it's easier for the following layer to pick them apart. So I was thinking at increased expressiveness because the attention output can in principle cover a larger volume.

Maybe expressiveness is not the right term, or not the main consequence. I could imagine that having different subspaces like that also introduces a degree of robustness to out-of-distribution inputs, as this would make it harder for the outputs of one attention head to shift towards the in-distribution outputs of another head, and thus for the following layer to confuse them.

I'm able to follow most of what you're saying. It's unclear to me what "convex hull" means though.

Also, where is each softmax happening here? For each attention head?

> It's unclear to me what "convex hull" means though.

The convex hull (https://en.wikipedia.org/wiki/Convex_hull) of a set is the smallest convex shape that includes that set. Geometrically, it's what you'd get if you "shrink wrapped" the thing you're looking at: edges still protrude, but any indentations get smoothed over.

In this context, the grandparent comment is pointing out that with a traditional transformer block, the resulting computed value for a token can never "stick out" past some weighted average of the values of attended-to tokens, but this differential attention formalism allows that result.

The softmax value y is a linear combination of the vectors you're attending over: y = a1v1 + a2v2 + ... + an*vn where a_i >= 0 and sum(a_i) = 1.

Then y is a convex combination of the v_i, and sits in the convex hull of the v_i.

The convex hull of a set of points is the region "between" those points. So the convex hull of three points (that do not lie on the same line) is a triangle with those three points as vertices. If you add a fourth point inside the triangle, the convex hull remains the same, but if you add it outside then the convex hull becomes the four-sided region with those points as vertices.

In the context of standard transformer attention, each output lies in the convex hull ("somewhere between") the input values. With the modification of this paper, the input values can be scaled a little so that the output of different heads can be in different "regions" and thus do not interfere with each other (so yes to your third question, the two softmaxes are performed separately for each head).

O_i = softmax(...) * V_i and softmax is between 0 and 1, so O_i = alpha * V_i for some alpha between 0 and 1 so that makes it convex, and it makes the O_i just a shrunken version of V_i. Whereas if you have the diff of softmaxes, you get O_i = (alpha - beta) * V_i, which can range from -V_i to +V_i, so its output could rescale /or/ flip V_i. And yes this is happening in every head in parallel, then they get summed.
By simply inputting your comment in to 4o, with no other context about the paper, I was able to get a pretty good analysis of the dual-head concept's implications.

https://chatgpt.com/share/67058973-ba94-8008-bed7-c7f9d08dc5...

Uh, this is extracting a LOT from very little data. I don't understand where it's coming from but it's explanation just keeps going into more and more detail ... that doesn't seem to follow from the data it's got.

I just don't see how you could answer these questions without trying it out. And chatgtp DEFINITELY isn't doing that.

Plus the obvious question I'd pose is not in there. What's the difference in performance between this trick and just "softmax() - 0.5 * 2" ? That seems very relevant.

Could you help explain how we would achieve an attention score of exactly 0, in practice? Here’s my take:

If we’re subtracting one attention matrix from another, we’d end up with attention scores between -1 and 1, with a probability of effectively 0 for any single entry to exactly equal 0.

What’s more, the learnable parameter \lambda allows for negative values. This would allow the model to learn to actually add the attention scores, making a score of exactly 0 impossible.

Your comment brings up two interesting variants that could be interesting if your goal is to increase the sparsity of the attention:

- Rectify the difference of the softmaxes. (min(0, s(A1) - lambda s(A2)))

- Apply the Heaviside function to the second softmax. (softmax(A1) - lambda H(s(A1) - lambda s(A2))

The second one being a bit more drastic and maybe harder to train.

It is a neat approach, but one that comes with a tradeoff, IIUC: doubling the key heads.

I wonder if a different approach without that issue exists. For instance, using max(0, exp(x)-1) instead of exp(x) in the softmax attention formula. That way when the query is orthogonal to the key (or worse), it does not contribute.

> using max(0, exp(x)-1) instead of exp(x)

Won't this cause the gradient to vanish on the left half, causing problems with training?

That is a concern that is shared with ReLU. But since the weights are shared across the context/minibatch, perhaps that would not be an issue, similar to ReLU.
> predict a weight of exactly zero for some of the values

Wouldn’t this be pretty unlikely, though?

Quite the opposite — if you have a long sequence only a smattering of the words will influence the meaning of the current word. Everything else is “noise”.

Attention is really good at finding this smattering of words (ie assign most weight there). But it struggles to put exactly 0 on the other words.

I mean wouldn’t it be unlikely that

SoftmaxA[n] - SoftmaxB[n] is exactly 0?

Even if 2 attention layers learn two different things, I would imagine the corresponding weights in each layer wouldn’t exactly cancel each other out.

why say lot word when few word do
Few word no do tho
U+1FAE5
Phew!
is this the same reason why fp4 with more parameters beats fp16 with less?
Noise cancelling headphones are probably the wrong analogy here.

The better example is the differential signalling used in professional audio and many digital signaling protocols like Ethernet, HDMI and USB.

Instead of using one wire, referencing to ground, they send the signal as the difference between both wires. Both wires end up carrying the same signal with inverted polarity. Because both wires are running next to each-other any external noice will be applied to both equally.

The voltage will change, but the difference in voltage between both wires is untouched. And when you subtract the two voltages at the receiver end, any noise simply gets subtracted out.

I think when they bring up differential amplifiers they're referring more to the DSP technique of how headphone noise cancelling works but the actual electrical properties of how a differential amplifier does that muddies the message a bit.

It sort of feels closer to heterodyning and "demodulating" the signal encoded in the softmax. Those tiny little errors we're trying to denoise with this technique are almost closer to carrier waves (when encoded to softmax) than noise imo. This wouldn't get rid of noise in the training data or noise in the dimensionality of the key / value space. It's really only removing noise introduced by the process itself.

This is a cool idea!
Don't look for an analogy, this just adds a new mathematical capability. It enables "negative attention", the network can say "I want to subtract the contribution of this token" in the attention calculation. Previously it could only reduce how much it adds.

The simple way of doing this would be to just remove the softmax or use a sigmoid instead, but in practice a softmax works better it seems.

My hypothesis for why this works that it mitigates the downsides of rope

to eli5:

rope is the modern strategy used to give information to the model about how far a query and a key are apart when doing attention. It's the best strategy we have now, but has a major downside, where it makes some connections between tokens that are far apart much stronger than you would like them to be. Xpos (https://arxiv.org/pdf/2212.10554) is another paper by microsoft tackling issues with rope and you can see figure 1 on page 4 to get a visual interpretation of the sinusoidal attention strength (you would like it to be smooth).

I think a big reason differential transformers is working so well, especially on long sequence stuff, because when both q1 and q2 don't match a token, the rope relative strength will still have the same value and the noise will cancel out. Leaving intended matches, but at the cost of somewhat dampening the original value rope brought.

Just a hypothesis though. It would be easy to test by running this experiment against a baseline where both use alibi attention (https://arxiv.org/pdf/2108.12409) which has a different set of tradeoffs this wouldn't mitigate, but still a really interesting result.

Some of the "prior art" here is ladder networks and to some handwavy extent residual nets, both of which can be interpreted as training the model on reducing the error to its previous predictions as opposed to predicting the final result directly. I think some intuition for why it works has to do with changing the gradient descent landscape to be a bit friendlier towards learning in small baby steps, as you are now explicitly designing the network around the idea that it will start off making lots of errors in its predictions and then get better over time.
It sounds like they're just splitting the query / key space down the middle. We don't know which dimensions are encoded in each matrix, but they're assuming the "noise" introduced in one query / key space is equivalent to noise introduced in the other space.

If that is the case, then the "signal" in this case would be the softmax that encodes the dimensions captured by the query / key space. Since the noise ideally is the same in both softmax encodings, subtracting them should "cancel out" the noise.

I think common mode filtering in balanced audio cables is a much better analogy than noise canceling headphones (and where this paper gets its name from I assume), you don't know what the noise is ahead of time, but if you take two samples with one positive and one negative, noise displaces both absolutely, which you can take advantage of to denoise the signal (find the differential mode).

For example, if you are trying to send a +1V signal on one wire, and a -1V signal on the other and a +0.5V noise exists, one wire will have +1.5V and the other will have -0.5V,

Take the difference and divide by 2:

(+1.5V - -0.5V) / 2 = +1V or, if your setup is different (-0.5V - +1.5V) / 2 = -1V

I don't understand either. It seems the general idea is that they calculate attention twice, which due to random initialization might be expected to give two slightly different results. I'd have thought that what these two attention maps would have in common would be the signal, and where they would differ would be noise, so rather than subtracting them (resulting in all noise?!) what you really want is to add (so the common signal gets reinforced) and normalize.
I think there might be some communalities with system engineering, where you subtract the output from the input in order to get a control signal that steers the plant to the target values. I too fail to see how that would be supposed to work in practice.
The values between the groups are also going to diverge during training due to the structure of the DiffAttn equation.

The analogy I can think of is when you're paying attention to a variety of things and you actively avoid concentrating on something because it will distract you. You don't give it zero attention, you give it negative attention.

the model is supposed to learn this