| You seem to repeatedly insist that hidden computation is a distinction of any relevance whatsoever. First of all, your understanding of the architecture itself is mistaken. A transformer can iterate endlessly because each token it produces allows it a forward pass, and each of these tokens is postpended to its input in the next inference. That's the autoregressive in autoregressive transformer, and the entire reason why it was proposed for arbitrary seq2seq transduction. This means you get layers * tokens iterations, where tokens is up to two million, and is in practice unlimited due to the LLM being able to summarize and select from that. Parallelism is irrelevant, since the transformer is sequential in the output of tokens. A transformer can iterate endlessly, it simply has to output enough tokens. And no, the throughput isn't limited either, since each token gets translated into a high-dimensional internal representation, that in turn is influenced by each other token in the model input. Models can encode whatever they want not just by choosing a token, but by choosing an arbitrary pattern of tokens encoding arbitrary latent-space interactions. Secondly, internal thoughts are irrelevant, because something being "internal" is an arbitrary distinction without impact. If I trained an LLM to prepend and postpend <internal_thought> to some part of its output, and then simply didn't show that part, then the LLM wouldn't magically become human. This is something many models do even today, in fact. Similarly, if I were to take a human and modify their brain to only be able to iterate using pen and paper, or by speaking out loud, then I wouldn't magically make them into something non-human. And I would definitely not reduce their capacity for reasoning in any way whatsoever. There are people with aphantasia working in the arts, there are people without an internal monologue working as authors - how "internal" something is can be trivially changed with no influence on either the architecture or the capabilities of that architecture. Reasoning itself isn't some unified process, neither is it infinite iteration. It requires specific understanding about the domain being reasoned over, especially understanding of which transformation rules are applicable to produce desired states, where the judgement about which states are desirable has to be learned itself. LLMs can reason today, they're just not as good at it than humans are in some domains. |
One reason why just blathering on endlessly isn't the same as thinking deeply before answering, is that it's almost impossible to maintain long-term context/attention. Try it. "Think step by step" or other attempts to prompt the model into generating a longer reply that builds upon itself, will only get you so far because keeping a 1-dimensional context is no substitute for the thousands of connections we have in our brain between neurons, and the richness of context we're therefore able to maintain while thinking.
The reasoning weakness of LLMs isn't limited to "some domains" that they had less training data for - it's a fundamental architecturally-based limitation. This becomes obvious when you see the failure modes of simple problems like "how few trips does the farmer need to cross the river with his chicken & corn, etc" type problems. You don't need to morph the problem to require out-of-distribution knowledge to get it to fail - small changes to the problem statement can make the model state that crossing the river backwards and forwards multiple times without loading/unloading anything is the optimal way to cross the river.
But, hey, no need to believe me, some random internet dude. People like Demis Hassabis (CEO of DeepMind) acknowledge the weakness too.