| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by dataangel 1120 days ago
	The author could have done far simpler tests to find GPT-4 has lots of trouble reasoning. Forget sorting, GPT4 has trouble counting. Repeat a letter N times and ask it how many there are. It breaks before you hit 20. Or try negating multiple times, since more than twice is rare in natural language, and again it will fall over.

8 comments

jbay808 1120 days ago

Author here. Happy to see this discussion. Absolutely, GPT-4 sometimes has trouble reasoning and doesn't reason perfectly. I'm impressed by its successes, but I agree it's not at the human level yet, and I would not make the claim that it is.

Counting is a task that transformers can do, per Weiss.[1] But it's not surprising that transformer networks in general have trouble counting characters -- the tokenizer replaces common sub-strings, so the number of characters will not in general be the number of tokens. The network might have little way of even knowing how many characters are in a given token if that information isn't encountered elsewhere in training.

[1]: https://arxiv.org/abs/2106.06981

flandish 1120 days ago

It has trouble “reasoning” because that is a human phenomenon. These ML driven LLMs or “AI” systems are, truly, “word calculators.”

They will never achieve “reason” or understand what it means to do so; they are not human.

Sure, with enough input (in the form of LLM) it can predict what a human’s reasoning may look like, but philosophically, that’s a different thing.

Reason is not universal like how math is.

koonsolo 1120 days ago

This is such bullshit. Time and time again this theory has been disproven. "Animals cannot reason", and then oops, sometimes our human brain is holding us back and rats are smarter at the task (https://www.researchgate.net/publication/259652611_More_comp...)

"A computer will never beat a chess master", etc.

Here are the facts for you: our reasoning is done by our brain. Our brain is just a bunch of processes. Those processes can be replicated in a computer. The number of cells and the speed can be improved. And there you have it, a superior reasoning machine.

These "only humans can do X" mostly comes from religion or other superiority bullshit, but in the end humans are not that special, although we seem to like to think so.

flandish 1119 days ago

Bullshit? Why so harsh? If our brain is a bunch of processes and chemicals - then does all of this matter in the end?

Superior? No.

Religion? No.

Philosophy? Yes.

TeMPOraL 1120 days ago

Huh. I'm all for human exceptionalism (until it stops being supported by observed evidence), but let's be specific on what makes human special. Yes, we absolutely stand high above all other (known) life (on Earth) - but we do so in the same sense GPT-4 stands high above GPT-3.5 and every other LLM currently known to the public. In quantity, not quality.

Biologically, we're clearly an increment over the next smartest thing - we have the same kind of hardware, doing the same things, built by the same process. But that increment carried us through the threshold where our brains became powerful enough to break our species free of biological evolution, and subjecting us to much faster process of technological evolution. This is why chimpanzees live in zoos built by humans, and not the other way around.

If anything, biological history of humanity tells us LLMs may just as well be thinking and reasoning in the same sense we are. That's because evolution by natural selection is a dumb, greedy, local optimization process that cannot fixate anything that doesn't provide incremental benefits along the way. In other words, whatever makes our brains tick, it's something that must 1) start with simple structures, 2) be easy to just randomly stumble on, 3) scale far, and 4) be scalable along a path that delivers capability improvements at every step. Transformer models fit all four of the points.

> with enough input (in the form of LLM) it can predict what a human’s reasoning may look like, but philosophically, that’s a different thing

By what school of philosophy? The one I subscribe to (whatever it's name) says it's absolutely the same thing. It's in agreement with science on this one.

flandish 1119 days ago

You mentioned biology early in your reply.

Computers are not biological. Therefore (imho) they will never obtain, and only replicate by trained example, the phenomenological human experiences.

TeMPOraL 1119 days ago

> Computers are not biological.

So what? Biology isn't magic, it's nanotech. A lot of very tiny machines. It obeys the same rules as everything else in the universe.

More than that, the theoretical foundation on which our computers are built is universal - it doesn't depend on any physical material. We've made analog computers using water flowing between buckets. We've made digital computers from pebbles falling down and bouncing around what's effectively a vertical labyrinth. We've made digital computers out of gears. We've made digital computers out of pencil, paper, and a human with lots of patience.

Hell, we can make light compute by shining it at a block of plastic with funny etching. We can really make anything compute, it's not a difficult task.

We're using electricity, silicon wafers and nanoscale lithography because it's a process that's been best for us in practice, not because it's somehow special. We can absolutely make a digital computer out of anything biological. Grow cells into structures implementing logic gates for chemical signals? Easy. Make a CPU by wiring literal neurons and nerve cells extracted from some poor animal? Sure, it can be done. At which point, would you say, such a computing tapestry made of living things gain the capability of "phenomenological experiences"? Or, conversely, what makes human brains fundamentally different from a computer we'd build out of nerve cells?

Vanit 1120 days ago

The author wrote a thoughtful article attempting to break this down and has humbly popped into the comments to discuss it... Is your direct response really just to cross your arms and say nope? Like, really?

flandish 1119 days ago

Using the metric “can reason” on a LLM is like using the metric “can bleed” on a stone.

Maybe some red stuff comes out when you break it. Is it blood, or is it something pumped into the other side?

xyzzy123 1120 days ago

If someone can show GPT-4 is "reasoning" (for some meaningful definition of that) in specific scenarios, surely counter-examples do not disprove this.

chaxor 1120 days ago

There are substantial works already showing reasoning capabilities in GPT-4, which show that these models do reason extremely well - near human performance for many causal reasoning tasks. (1) Additionally, there is a mathematical proof that these systems align with dynamic programming, and therefore can do algorithmic reasoning. (2,3)

1) https://arxiv.org/abs/2305.00050.pdf 2) https://arxiv.org/pdf/1905.13211.pdf 3) https://arxiv.org/pdf/2203.15544.pdf

pas 1119 days ago

is GPT4 a graph neural network? also, isn't it training time and data dependent how big (how many tokens) a problem it can tackle?

so it's great that it can reason better than humans on small-medium probems already well trained for, but so far Transformers are not reasoning (not doing causal graph analysis, or not even doing zero order logic), they are eerily well writing text that has the right keywords. and of course it's very powerful and probably will be useful for many applications.

chaxor 1119 days ago

They are GNNs with attention as the message passing function and additional concatenated positional embeddings. As for reasoning, these are not quite 'problems well-trained for', in the sense that they're not in the training data. But they are likely problems that have some abstract algorithmic similarity, which is the point.

I'm not quite sure what you mean that they cannot do causal graph analysis, since that was one of many different tasks provided in the various different types of reasoning studies in the paper I mentioned. In fact it may have been the best performing task. Perhaps try checking the paper again - it's quite a lot of experiments and text, so it's understandable to not ingest all of it quickly.

In addition, if you're interested in seeing further evidence of algorithmic reasoning capabilities occurring in transformers, Hattie Zhou has a good paper on that as well. https://arxiv.org/pdf/2211.09066.pdf

The story is really not shaping up to be 'stochastic parrots' if any real deep analysis is performed. The only way that I see someone could have such a conclusion is if they are not an expert in the field, and simply glance at the mechanics for a few seconds and try to ham handedly describe the system (hence the phrase: "it just predicts next token"). Of course, this is a bit harsh, and I don't mean to suggest that these systems are somehow performing similar brain-like reasoning mechanisms (whatever that may mean) etc, but stating that they cannot reason (when there is literature on the subject) because 'its just statistics' is definitely not accurate.

pas 1119 days ago

> they cannot do causal graph analysis

I mean the ANN in the inference stage when run does not draw up a nice graph, doesn't calculate weights, doesn't write down pretty little Bayesian formulas, it does whatever is encoder in the matrices-innerproduct-context.

And it's accurate in a lot of cases (because there's sufficient abstract similarity in the training data), and that's what I meant by "of course it'll likely be useful in many cases".

At least this is my current "understanding", I haven't had time to dig into the papers unfortunately. Thanks for the further recommendation!

What seems very much missing is characterizing the reasoning that is going on. Its limitations, functional dependencies, etc.

jxf 1120 days ago

If a counterexample to a specific claim doesn't disprove the claim, that sometimes suggests the claim is unfalsifiable and therefore suspect.

fasterik 1120 days ago

The claim is that GPT-4 can reason sometimes. Evidence that GPT-4 fails to reason sometimes isn't a counterexample.

krainboltgreene 1120 days ago

The people who made GPT-4 have said it does not reason, please for the love of god drop this nonsense.

fasterik 1120 days ago

I never said that it does. I was pointing out a logical flaw in that person's argument. Also, why are the creators of GPT-4 authorities on what does or doesn't count as reasoning?

travisjungroth 1120 days ago

It’s suspect until it’s demonstrated. Once someone has demonstrated it, counterexamples are meaningless.

I claim I can juggle. I pick up three tennis balls and juggle them. You hand me three basketballs. I try and fail. My original claim, that I can juggle, still stands.

physPop 1120 days ago

I disagree -- that would disprove your claim, as your claim was too broad. Same if they handed you chainsaws or elephants, or seventy-two tennis balls.

The more correct claim is you can juggle [some small number of items with particular properties].

travisjungroth 1120 days ago

Normal English implies that you can do something, not everything. It’s an any versus all distinction, and all is totally unreasonable except for the most formal circumstances.

“Can you ride a bike?”

“Yeah.”

“Prove it. Here I have the world’s smallest bicycle.” <- this person is not worth your time and attention.

alienicecream 1120 days ago

It can't even count reliably. And this is a computer, not a human. That is one of the simplest things a computer should be able to do. It can't count because it doesn't know what counting is, not because it's unreliable in the way a human would be when counting. You cannot reason if you do not understand the concepts you are working with. The result is not the measure of success here, because it is good at mimicking, but when it fails at such a basic computing task, you can reasonably conclude it has no idea what it's doing.

pxc 1120 days ago

That's because

> I can juggle

is here shorthand for

> I can juggle at all; I can juggle at least some things

and the basketball case is only a counterexample to the much stronger claim

> I can juggle anything

But the argument about AIs reasoning has little to do with such examples, because juggling is about the ability to complete the task alone. When it comes to reasoning there are questions about authenticity that don't have analogs I'm determine whether a person can juggle.

travisjungroth 1120 days ago

This thread is exactly such examples.

What would “not alone” mean? Do you think someone is passing it the answers? Of course it was trained, but that’s like cheating on a test by reading the material so you can keep a cheat sheet in your brain.

Buttons840 1120 days ago

So, I just tried this. I pasted 60 letter A's into GTP4 and asked it to count, it got it wrong, but I repeatedly said "count again" and nothing else, so as to not give it any hints. Here's GTP4's guesses along the way as I repeatedly said "count again".

69, 50, 100, 70, 68, 60, 60, 60, 60 (GTP gathered its own guesses into this list for me BTW)

It seems if GTP is given "attention" enough, it can do the counting. But it cannot direct its attention freely, only as we give it instruction to do so.

I just did it again with 66 letter A's. Guesses were: 100, 100, 98, 67, 66, 66, 66, 66 -- GTP4 again settled on the correct answer. I also burned though my prompt quota for the next 3 hours :(

Also, as a GTP style challenge, how many numbers are in this message? You have half-a-second, go!

EForEndeavour 1120 days ago

Aside: I've become so overexposed to the acronym "GPT" from months of completely breathless hype that I'm taken aback whenever I see it consistently misspelled as e.g. GTP. Feels like the equivalent of seeing someone inexplicably write about "chainblock technology."

Buttons840 1119 days ago

I've been corrected.

pixl97 1120 days ago

LLMs don't see words like you do. Tokenization makes it behave odd. Often you can get 4 to output a solution in code and from that it derives a correct answer.

kordlessagain 1118 days ago

Yeah, I'm sorry I missed this! The example of asking it things like counting or sequences isn't a great one because it's been solved by asking it to "translate" to code and then run the code. I took this up as a challenge a while back with a similar line of reasoning on Reddit (that it couldn't do such a thing) and ended up implementing it in my AI web shell thing.

  heavy-magpie|> I am feeling excited.
  system=> History has been loaded.
  pastel-mature-herring~> !calc how many Ns are in nnnnnnnnnnnnnnnnnnnn
  heavy-magpie|> Writing code.
  // filename: synth_num_ns.js
  // version: 0.1.1
  // description: calculate number of Ns
  var num_ns = 'nnnnnnnnnnnnnnnnnnnn';
  var num_Ns = num_ns.length;
  Sidekick("There are " + num_Ns + " Ns in " + num_ns + ".");
  heavy-magpie|> There are 20 Ns in nnnnnnnnnnnnnnnnnnnn.

As far as the not not thing, ChatGPT-4 seems to handle that pretty well...

fnordpiglet 1120 days ago

I’d note none of these are reasoning tasks.

xwdv 1120 days ago

No, but if you ask it “Are you sure?” after it gives an answer, then it becomes a reasoning task and it often gives a different wrong answer.

flandish 1120 days ago

No, it does not become a reasoning task. The human asking “are you sure?” is actually just inputting words into the model.

The model outputs what it predicts a statistically normal output would fit in the context given.

Truly “llm” and these gpt tools are very much large scale “soundex” models.

Fantastic and great.

But not ai or even agi.

fnordpiglet 1119 days ago

AI is a pretty broad term. I think it safely fits there. AGI/ sentience / etc. No. And yes it’s not reasoning because by definition reasoning requires agency, as you point out. However I think you make a few assumptions I wouldn’t be comfortable with.

Is human intelligence anything more than a statistical model? Our entire biology is a massive gradient descent optimization system. Our brains are no different. The establishment of connectivity and potential and resistance, etc etc, it’s statistical in behavior all the way down. Our way of learning is what these models are built around, to the best of our ability. It’s not perfect but it’s a reasonable approximation.

Further it’s not soundex. I see the stochastic parrot argument too much and it’s annoying. Soundex is symbolic only. LLMs are also semantic. In fact the semantic nature is where their interesting properties emerge from. The “just a fancy Markov model” or “just a large scale soundex” misses the entire point of what they do. Yes they involve tokenizing and symbols and even conditional probability. But so does our intelligence. The neural net based attention to semantic structure is however not soundex of Markov model. It’s a genuine innovation and the properties that emerge are new.

But new doesn’t mean complete. To be complete you need to build an ensemble model integrating all the classical techniques of goal based agency, optimization, solvers, inductive/deductive reasoning systems, IR, etc etc in a feedback loop. The LLM provides an ability to reason abductively in an abstract semantic space and interpret inputs and draw conclusions classical AI is very bad at. The places where LLM fall down… well, classical AI really shine there. Why does it need to be able to do logic as well as a logical solver? We already have profoundly powerful logic systems. Why does it need to count? We already have things that count. What we did not have is what LLMs provide, and more specifically multimodal LLMs.

pixl97 1120 days ago

I mean, in training children we give them reasoning tasks they commonly get wrong. I don't think we say they are incapable of reasoning because they get wrong answers commonly?

This is why we see improvement in GPT when chain of thought/tree of thought is used with reasoning for each step. That can't correct every failure mode, but it increases the likelihood you'll receive a more correct answer.

fnordpiglet 1120 days ago

Are you sure?

PartiallyTyped 1120 days ago

Approaches that involve a scratchpad or eg algorithmic execution should deal with this just fine.

The algorithmic execution paper argues GPT 4 can do arithmetic woth 13 digit numbers before performance drops below 95%.

anigbrowl 1120 days ago

I have bad news for you about human people...