Hacker News new | ask | show | jobs
by pizza 613 days ago
O_i = softmax(...) * V_i and softmax is between 0 and 1, so O_i = alpha * V_i for some alpha between 0 and 1 so that makes it convex, and it makes the O_i just a shrunken version of V_i. Whereas if you have the diff of softmaxes, you get O_i = (alpha - beta) * V_i, which can range from -V_i to +V_i, so its output could rescale /or/ flip V_i. And yes this is happening in every head in parallel, then they get summed.
1 comments

By simply inputting your comment in to 4o, with no other context about the paper, I was able to get a pretty good analysis of the dual-head concept's implications.

https://chatgpt.com/share/67058973-ba94-8008-bed7-c7f9d08dc5...

Uh, this is extracting a LOT from very little data. I don't understand where it's coming from but it's explanation just keeps going into more and more detail ... that doesn't seem to follow from the data it's got.

I just don't see how you could answer these questions without trying it out. And chatgtp DEFINITELY isn't doing that.

Plus the obvious question I'd pose is not in there. What's the difference in performance between this trick and just "softmax() - 0.5 * 2" ? That seems very relevant.