| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by Vetch 591 days ago

Note that this is an expressibility (upper) bound on transformers granted intermediate decoding steps. It says nothing about their learnability, and modern LLMs are not near that level of expressive capacity.

The authors also introduce projected pre-norm and layer-norm hash to facilitate their proofs, another sense in which it is an upper-bound on the current approach to AI, since these concepts are not standard. Nonetheless, the paper shows how allowing a number of intermediate decoding steps polynomial in input size is already enough to run most programs of interest (which are in P).

There are additional issues. This work relies on the concept of saturated attention, however as context length grows in real world transformers, self-attention deviates from this model as it becomes noisier, with unimportant indices getting undue focus (IIUC, due to precision issues and how softmax assigns non-zero probability to every token). Finally, it's worth noting that the more under-specified your problem is, and the more complex the problem representation is, then the quickly more intractable the induced probabilistic inference problem. Unless you're explicitly (and wastefully) programming a simulated turing machine through the LLM, this will be far from real-time interactive. Users should expect a prolog like experience of spending most of their time working out how to help search.

Trivia: Softmax also introduces another problem: the way softmax is applied forces attention to always assign importance to some tokens, often leading to dumping of focus on typically semantically unimportant tokens like whitespace. This can lead to an overemphasis on unimportant tokens, possibly inducing spurious correlations on whitespace, this propagating through the network with possibly unexpected negative downstream effects around whitespace.