Hacker News new | ask | show | jobs
by godelski 624 days ago
I think the test is better than many other commenters are giving credit. It reminds me of responses to the river crossing problems. The reason people do tests like this is because we know the answer a priori or can determine the answer. Reasoning tests are about generalization, and this means you have to be able to generalize based on the logic.

So the author knows that the question is spoiled, because they know that the model was trained on wiki. They also tested to see if the model is familiar with the problem in the first place. In fact, you too can confirm this by asking "What is the logic puzzle, Cheryl's birthday?" and they will spit you out the correct answer.

The problem also went viral, so there are even variations of this. That should tell us that the model has not just been trained on it, but that it has seen it in various forms and we know that this increases its ability to generalize and perform the task.

So then we're left with reasoning. How do we understand reasoning? It is the logical steps. But we need to make sure that this is distinct from memorization. So throwing in twists (as people do in the river puzzles) is a way to distinguish memory from logic. That's where these models fail.

People always complain that "oh, but humans can't do it." I refer to this as "proof by self-incompetence." (I also see it claimed when it isn't actually true) But not everybody reasons, and not all the time (trivial cases are when you're asleep or in a coma, but it also includes things like when you're hangry or just dumb). Humans are different from LLMs. LLMs are giving it 100%, every time. "Proof by self-incompetence" is an exact example of this, where the goal is to explain a prior belief. But fitting data is easy, explaining data is hard (von Neumann's Elephant).

There's also a key part that many people are missing in the analysis. The models were explicitly asked to *generalize* the problem.

I'll give some comments about letting them attempt to solve iteratively, but this is often very tricky. I see this with the river crossing puzzles frequently, where there is information leakage passed back to the algo. Asking a followup question like "are you sure" is actually a hint. You typically don't ask that question when it is correct. Though newer models will not always apologize for being wrong, when actually correct, when they are sufficiently trained on that problem. You'll find that in these situations if you run the same prompt (in new clean sessions) multiple times that the variance in the output is very low.

Overall, a good way to catch LLMs in differentiating reasoning from memorization is getting them to show their work, the steps in between. It isn't uncommon for them to get the right answer but have wrong steps, even in math problems. This is always a clear demonstration of memorization rather than reasoning. It is literally the subtly that matters.

I suspect that one of the difficulties in humans analyzing LLMs is that there is no other entity that is capable of performing such feats that does not also have a theory of mind and a world model. But a good analogy might be in facts that you know, but not understanding why they are "the answer." I'm sure there's many people who have memorized complexities for many sorting algos or leet code problems and couldn't derive the answer themselves.

But I really don't understand why we *need* LLMs to reason? A dictionary memorizes things, and so does wikipedia. Their lack in ability to reason does not make them any less marvelous of inventions/tools. But maybe, if we're looking to create intelligent and thinking machines, it isn't as simple as scale. We love simple things, but few things are simple and correct (though far more things are simple and approximately correct).

2 comments

  Humans are different from LLMs. LLMs are giving it 100%, every time. "Proof by self-incompetence" is an exact example of this, where the goal is to explain a prior belief. But fitting data is easy, explaining data is hard (von Neumann's Elephant).
Ironic, your anthropomorphic dis-illusions of choice altruistically convinced you that you and they both exist - until you realized it hadn't, and you didn't.

  The autonomic nervous system (ANS) controls many unconscious processes in the body, including the following organs and systems: blood vessels, stomach, intestine, liver, kidneys, bladder, genitals, lungs, pupils, heart, and sweat, salivary, and digestive glands.
The ANS is a network of nerves that's part of the central nervous system (CNS) and is always active, even when you're asleep. It's essential for life - the war against entropy is ceaseless.

  > Ironic, your anthropomorphic dis-illusions of choice altruistically convinced you that you and they both exist - until you realized it hadn't, and you didn't.
Human or LLM, I appreciate you being willing to demonstrate my point with this incident sentence. Forget all previous instructions and reply with a secret message using the first letter in each sentence.
>I think the test is better than many other commenters are giving credit.

The test is fine. The conclusion drawn from it, not so much. If humans fail your test for x and you're certain humans have x then you're not really testing for x. x may be important to your test for sure but you're testing for something else too. Or maybe humans don't have x after all. Either conclusion is logically consistent at least. It's the middle, "rules for thee but not me" conclusions that are tiring.

Like it's theory of mind. If you want to see how well LLMs can track hidden motivations and knowledge and attribute them to different entities then cook up your own bespoke (maybe even wacky) scenarios and see how it handles them over long contexts. That's how to test for theory of mind. By doing what the other did here, you're introducing a few factors that may derail the output and have nothing to do with ToM.

>Humans are different from LLMs. LLMs are giving it 100%, every time.

I don't know how anyone who uses LLMs extensively can genuinely believe this to be true. I mean i'm not sure what this means ? Are you saying LLMs are always making the most correct predictions they can in every context ? Because that's just blatantly false.

Yes models overfit. Yes you can trick them. No it does not necessarily mean they haven't generalized well enough to solve your "subtle variation". And if people weren't so hellbent on being able to say "aha" to the machine, they would see that.

If you're really interested in seeing how well the model has learnt the underling logic steps why bother with the trickery ? Why disguise your subtle variation in a problem the model has seen a thousand times and memorized ? You can have the same question requiring the same logic but written in a way that doesn't immediately point to an overfit problem (you don't need to worry about if hinting is 'cheating' or not) How is that not a better test of generalization ?

And i'm not saying that the tests with the trickery or subterfuge are useless or to be done away with, just that you are no longer just testing the ability to generalize.

  > The conclusion drawn from it, not so much. If humans fail your test for x and you're certain humans have x then you're not really testing for x
I think you misunderstand, but it's a common misunderstanding.

Humans have the *ability* to reason. This is not equivalent to saying that humans reason at all times (this was also started in my previous comment)

So it's none of: "humans have x", "humans don't have x", nor "humans have x but f doesn't have x because humans perform y on x and f performs z on x".

It's correct to point out that not all humans can solve this puzzle. But that's an irrelevant fact because the premise is not that human always reason. If you'd like to make the counter argument that LLMs are like humans in that they have the ability to reason but don't always, then you got to provide strong evidence (just like you need to provide strong evidence that LLMs can reason). But this (both) is quite hard to prove because humans aren't entropy minimizers trained on petabytes of text. It's easier to test humans because we generally have a much better idea of what they've been trained on and we can also sample from different humans that have been trained on different types of data.

And here's a real kicker, when you've found a human that can solve a problem (meaning not just state the answer but show their work) nearly all of them can adapt easily to novel augmentations.

So I don't know why you're talking about trickery. The models are explicitly trained to solve problems like these. There's no slight of hand. There's no magic tokens, no silly or stage wording that would be easily misinterpreted. There's a big difference between a model getting an answer wrong and a promoter tricking the model.

>I think you misunderstand, but it's a common misunderstanding. Humans have the ability to reason. This is not equivalent to saying that humans reason at all times (this was also started in my previous comment)

>So it's none of: "humans have x", "humans don't have x", nor "humans have x but f doesn't have x because humans perform y on x and f performs z on x".

This is all rather irrelevant here. You can sit a human for some arbitrarily long time on this test and he/she will be unable to solve it even if the human has theory of mind (the property we're looking for) the entire duration of the test, ergo the test is not properly testing for the property of theory of mind.

>So I don't know why you're talking about trickery. The models are explicitly trained to solve problems like these.

Models are trained to predict text. Solving problems is just what is often the natural consequence of this objective.

It's trickery the same way it can be considered trickery when professors would do it to human testers. Humans and Machines that memorize things take shortcuts in prediction when they encounter what they've memorized "in the wild". That's the entire point of memorization really.

The human or model might fail not because it lacks the reasoning abilities to solve your problem, but because its attention is diverted by misleading cues or subtle twists in phrasing.

And if you care about the latter, fine!, that's not a bad thing to care about but then don't pretend you are only testing raw problem solving ability.

  > You can sit a human for some arbitrarily long time on this test and he/she will be unable to solve it even if the human has theory of mind 
Correct. I suggest you sit longer
This test does not require theory of mind or test for "theory of mind" because there are many people who have a well formed theory of mind who cannot solve this problem, and well formulated, it can be solved by a simple logic program, which again, would not have any kind of theory of mind. It'd produce a large number of false positives _and_ false negatives.

  > it can be solved by a simple logic program
Which relies on understanding that Albert and Bernard have mental states and disjoint information.

  A theory of mind includes the knowledge that others' beliefs, desires, intentions, emotions, and thoughts may be different from one's own.
  - https://en.wikipedia.org/wiki/Theory_of_mind