Hacker News new | ask | show | jobs
by Majromax 613 days ago
> It's unclear to me what "convex hull" means though.

The convex hull (https://en.wikipedia.org/wiki/Convex_hull) of a set is the smallest convex shape that includes that set. Geometrically, it's what you'd get if you "shrink wrapped" the thing you're looking at: edges still protrude, but any indentations get smoothed over.

In this context, the grandparent comment is pointing out that with a traditional transformer block, the resulting computed value for a token can never "stick out" past some weighted average of the values of attended-to tokens, but this differential attention formalism allows that result.

1 comments

The softmax value y is a linear combination of the vectors you're attending over: y = a1v1 + a2v2 + ... + an*vn where a_i >= 0 and sum(a_i) = 1.

Then y is a convex combination of the v_i, and sits in the convex hull of the v_i.