Hacker News new | ask | show | jobs
by piotr93 548 days ago
The only computations that an LLM does are backprops and forward passes. It can not run any arbitrary program description. Yes, it will hallucinate your program's output if you feed it some good enough starting prompt. But that's it.
2 comments

An LLM with chain of thought and unbounded compute/context can run any program in PTIME: https://arxiv.org/abs/2310.07923 , which is a huge class of programs.
Note that this is an expressibility (upper) bound on transformers granted intermediate decoding steps. It says nothing about their learnability, and modern LLMs are not near that level of expressive capacity.

The authors also introduce projected pre-norm and layer-norm hash to facilitate their proofs, another sense in which it is an upper-bound on the current approach to AI, since these concepts are not standard. Nonetheless, the paper shows how allowing a number of intermediate decoding steps polynomial in input size is already enough to run most programs of interest (which are in P).

There are additional issues. This work relies on the concept of saturated attention, however as context length grows in real world transformers, self-attention deviates from this model as it becomes noisier, with unimportant indices getting undue focus (IIUC, due to precision issues and how softmax assigns non-zero probability to every token). Finally, it's worth noting that the more under-specified your problem is, and the more complex the problem representation is, then the quickly more intractable the induced probabilistic inference problem. Unless you're explicitly (and wastefully) programming a simulated turing machine through the LLM, this will be far from real-time interactive. Users should expect a prolog like experience of spending most of their time working out how to help search.

Trivia: Softmax also introduces another problem: the way softmax is applied forces attention to always assign importance to some tokens, often leading to dumping of focus on typically semantically unimportant tokens like whitespace. This can lead to an overemphasis on unimportant tokens, possibly inducing spurious correlations on whitespace, this propagating through the network with possibly unexpected negative downstream effects around whitespace.

> "An LLM with unbounded compute/context"

This isn't a thing we have, or will have.

It's like saying that a computer with infinite memory, CPU and power can certainly break SHA-256 and bring the world's economy down with it.

No, it's like saying "a computer can't crack SHA hashes, it can only add and subtract numbers together" "a computer can crack any SHA hash" "yes, given infinite time".

The fact that you need infinite time for some of the stuff doesn't mean you can't do any of the stuff.

I mean it doesn't need to compute all programs in a human length reasonable amount of time.

It just needs to be able to compute enough programs to be useful.

Even our current infrastructure of precisely defined programs and compilers isn't able to compute all programs.

It seems reasonable in the future be able to give an LLM the python language specification, a python program, and it iteratively returns the answer.

If it's executing a program, then the easiest way to make it more efficient, is to ditch the LLM and just execute the program. The LLM in this case is basically only (very very very very very inefficiently) approximating the very CPU it's running on. Just use the CPU to execute the program! And you won't be running it on an approximated processor, you'll be running it on a deterministic, reliable one that will not give the wrong answer ever (given the right program and correct input, of course, and assuming no hardware failures, which would affect LLMs too)
They're a neural network like any other. They are universal function approximators. Doesn't mean that approximating a function that executes a particular program doesn't need an intractably large neural network.
Woah super interesting, I didn't know about this. Will def read it! Seems like I was wrong?
genuinely what's the point of this comment? are you allergic to cool stuff? honestly curious as to what you were trying to achieve by this comment.

nowhere in this post does the author say that it's ready with the current state of models, or he'd use a foundation model for this. why the hate?