|
|
|
|
|
by Majromax
613 days ago
|
|
> It's unclear to me what "convex hull" means though. The convex hull (https://en.wikipedia.org/wiki/Convex_hull) of a set is the smallest convex shape that includes that set. Geometrically, it's what you'd get if you "shrink wrapped" the thing you're looking at: edges still protrude, but any indentations get smoothed over. In this context, the grandparent comment is pointing out that with a traditional transformer block, the resulting computed value for a token can never "stick out" past some weighted average of the values of attended-to tokens, but this differential attention formalism allows that result. |
|
Then y is a convex combination of the v_i, and sits in the convex hull of the v_i.