An LLM with chain of thought and unbounded compute/context can run any program in PTIME: https://arxiv.org/abs/2310.07923 , which is a huge class of programs.
Note that this is an expressibility (upper) bound on transformers granted intermediate decoding steps. It says nothing about their learnability, and modern LLMs are not near that level of expressive capacity.
The authors also introduce projected pre-norm and layer-norm hash to facilitate their proofs, another sense in which it is an upper-bound on the current approach to AI, since these concepts are not standard. Nonetheless, the paper shows how allowing a number of intermediate decoding steps polynomial in input size is already enough to run most programs of interest (which are in P).
There are additional issues. This work relies on the concept of saturated attention, however as context length grows in real world transformers, self-attention deviates from this model as it becomes noisier, with unimportant indices getting undue focus (IIUC, due to precision issues and how softmax assigns non-zero probability to every token). Finally, it's worth noting that the more under-specified your problem is, and the more complex the problem representation is, then the quickly more intractable the induced probabilistic inference problem. Unless you're explicitly (and wastefully) programming a simulated turing machine through the LLM, this will be far from real-time interactive. Users should expect a prolog like experience of spending most of their time working out how to help search.
Trivia: Softmax also introduces another problem: the way softmax is applied forces attention to always assign importance to some tokens, often leading to dumping of focus on typically semantically unimportant tokens like whitespace. This can lead to an overemphasis on unimportant tokens, possibly inducing spurious correlations on whitespace, this propagating through the network with possibly unexpected negative downstream effects around whitespace.
No, it's like saying "a computer can't crack SHA hashes, it can only add and subtract numbers together" "a computer can crack any SHA hash" "yes, given infinite time".
The fact that you need infinite time for some of the stuff doesn't mean you can't do any of the stuff.
If it's executing a program, then the easiest way to make it more efficient, is to ditch the LLM and just execute the program. The LLM in this case is basically only (very very very very very inefficiently) approximating the very CPU it's running on. Just use the CPU to execute the program! And you won't be running it on an approximated processor, you'll be running it on a deterministic, reliable one that will not give the wrong answer ever (given the right program and correct input, of course, and assuming no hardware failures, which would affect LLMs too)
They're a neural network like any other. They are universal function approximators. Doesn't mean that approximating a function that executes a particular program doesn't need an intractably large neural network.
The authors also introduce projected pre-norm and layer-norm hash to facilitate their proofs, another sense in which it is an upper-bound on the current approach to AI, since these concepts are not standard. Nonetheless, the paper shows how allowing a number of intermediate decoding steps polynomial in input size is already enough to run most programs of interest (which are in P).
There are additional issues. This work relies on the concept of saturated attention, however as context length grows in real world transformers, self-attention deviates from this model as it becomes noisier, with unimportant indices getting undue focus (IIUC, due to precision issues and how softmax assigns non-zero probability to every token). Finally, it's worth noting that the more under-specified your problem is, and the more complex the problem representation is, then the quickly more intractable the induced probabilistic inference problem. Unless you're explicitly (and wastefully) programming a simulated turing machine through the LLM, this will be far from real-time interactive. Users should expect a prolog like experience of spending most of their time working out how to help search.
Trivia: Softmax also introduces another problem: the way softmax is applied forces attention to always assign importance to some tokens, often leading to dumping of focus on typically semantically unimportant tokens like whitespace. This can lead to an overemphasis on unimportant tokens, possibly inducing spurious correlations on whitespace, this propagating through the network with possibly unexpected negative downstream effects around whitespace.