| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by extr 396 days ago
	I find Gary's arguments increasingly semantic and unconvincing. He lists several examples of how LLMs "fail to build a world model", but his definition of "world model" is an informal hand-wave ("a computational framework that a system (a machine, or a person or other animal) uses to track what is happening in the world"). His examples are lifted from a variety of unclear or obsolete models - what is his opinion of O3? Why doesn't he create or propose a benchmark that researchers could use to measure progress of "world model creation"? What's more, his actual point is unclear. Even if you simply grant, "okay, even SOTA LLMs don't have world models", why do I as a user of these models care? Because the models could be wrong? Yes, I'm aware. Nevertheless, I'm still deriving subtantial personal and professional value from the models as they stand today.

2 comments

voidhorse 396 days ago

I think the point is that category errors or misinterpreting what a tool does can be dangerous.

Both statistical data generators and actual reasoning are useful in many circumstances, but there are also circumstances in which thinking that you are doing the latter when you are only doing the former can have severe consequences (example: building a bridge).

If nothing else, his perspective is a counterbalance to what is clearly an extreme hype machine that is doing its utmost to force adoption through overpromising, false advertising, etc. These are bad things even if the tech does actually have some useful applications.

As for benchmarks, if you fundamentally don't believe that stochastic data generation leads to reason as an emergent property, developing a benchmark is pointless. Also, not everyone has to be on the same side. It's clear that Marcus is not a fan of the current wave. Asking him to produce a substantive contribution that would help them continue to achieve their goals is preposterous. This game is highly political too. If you think the people pushing this stuff are less than estimable or morally sound, you wouldn't really want to empower them or give them more ideas.

NitpickLawyer 396 days ago

> If nothing else, his perspective is a counterbalance to what is clearly an extreme hype machine that is doing its utmost to force adoption through overpromising, false advertising, etc. These are bad things even if the tech does actually have some useful applications.

In other words, overhyped in the short term, underhyped in the long term. Where short and long term are extremely volatile.

Take programming as an example. 2.5 years ago, gpt3.5 was seen as "cute" in the programming world. Oh, look, it does poems and e-mails, and the code looks like python but it's wrong 9 times out of 10. But now a 24B model can handle end-to-end SWE tasks in 0-shot a lot of the times.

nmadden 395 days ago

The improvements in programming are largely due to the adoption of “agentic” architectures. This is really a hybrid neural-symbolic approach: the symbolic part being the interpreter/compiler. Effectively the LLM still produces an almost-correct-but-wrong program and then the compiler “fact-checks” it and then the LLM basically local-searches its way from there to something that passes the compiler. (If you want to be disabused of the idea that LLMs on their own are good at programming, just review the “reasoning” log of one trying to fix a simple string | undefined error in Typescript).

It seems clear to me therefore that further improvements in programming ability will not come from better LLM models (which have not really improved much), but from better integration of more advanced compilers. That is, the more types of errors that can be caught by the compiler, the better chance of the AI fuzzing its way to a good overall solution. Interestingly, I hear anecdotally that current LLMs are not great at writing Rust, which does have an advanced type system able to capture more types of errors. That’s where I’d focus if I was working on this. But we should be clear that the improvements are already largely coming via symbolic means, not better LLMs.

I wrote some notes about a year ago about the irony of LLMs being considered a refutation of GOFAI when they are actually now firmly recapitulating that paradigm: https://neilmadden.blog/2024/06/30/machine-learning-and-the-...

NitpickLawyer 395 days ago

> The improvements in programming are largely due to the adoption of “agentic” architectures.

Yes, I agree. But it's not just the cradles, it's cradles + training on traces produced with those cradles. You can test this very easily with running old models w/ new cradles. They don't perform well at all. (one of the first things I did when guidance, a guided generation framework, launched ~2 years ago was to test code - compile - edit loops. There were signs of it working, but nothing compared to what we see today. That had to be trained into the models.)

> will not come from better LLM models (which have not really improved much), but from better integration of more advanced compilers.

Strong disagree. They have to work together. This is basically why RL is gaining a lot of traction in this space.

Also disagree on llms not improving much. Whatever they did with gemini 2.5 feels like gpt3-4 to me. The context updates are huge. This is the first model that can take 100k tokens and still work after that. They're doing something right to be able to support such large contexts with such good performance. I'd be surprised if gemini 2.5 is just gemini 1 + more data. Extremely surprised. There have to be architecture changes and improvements somewhere in there.

daveguy 395 days ago

> You can test this very easily with running old models w/ new cradles. They don't perform well at all.

This is because neither the LLMs nor the cradles are intelligent.

> They have to work together.

Exactly. Because they are essentially a single, brittle model. Not a "smart" text generator + a "smart" validation system.

LLMs are an enormous breakthrough in NLP and something like it will be part of an AGI system. But there is no path to AGI without more breakthroughs.

squirrel 396 days ago

He cites o3 and o4-mini as examples of LLMs that play illegal chess moves.

Lerc 396 days ago

I don't understand the reasoning behind drawing a conclusion that if something fails a task that requires reasoning implies that thing cannot reason.

To use chess as an example. Humans sometimes play illegal moves. That does not mean Humans cannot reason. It is an instance of failing to show proof of reasoning. Not a proof of the inability to reason.

voidhorse 396 days ago

I don't think that's a fair representation of the argument.

The argument is not "here's one failure case, therefore they don't reason". The argument is that systematically if you given an LLM problem instances outside training sets in domains with clear structural rules, they will fail to solve them. The argument then goes that they must not have an actual model or understanding of the rules, as they seem to only be capable of solving problems in the training set. That is, they have failed to figure out how to solve novel problem instances of general problem structures using logical reasoning.

Their strict dependence on having seen the exact or extremely similar concrete instances suggests that they don't actually generalize—they just compute a probability based on known instances—which everyone knew already. The problem is we just have a lot of people claiming they are capable of more than this because they want to make a quick buck in an insane market.

Lerc 396 days ago

That still seems unfalsifiable. If it fails one instance the claim is that the failure is representative of things outside the training set. If it succeeds the claim is that it is in the training set. Without a definitive way to say something is not in the training set (a likely impossible task) the measure of success or failure is the only indicator of the purported reason reason for the success or failure.

Given models can get things wrong even when the training data contains the answer, failure cannot show absence.

voidhorse 395 days ago

I do think there are cases which, in controlled environments, there is some degree of knowledge as to what is in the training set. I also don't thin it's as impossible as you assume.

If you really wanted to ensure this with certainty just use the natural numbers to parameterize an aspect of a general problem. Assume there are N foo problems in the training set, then there is always a case N+1 parameter not in the training set, and you can use this as an indicative case. Go ahead and generate an insane number of these and eventually the probability that the Mth instance is not in the set is effectively 1.

Edit: Of course, it would not be perfect certainty, but it is probabilistically effectively certain. The number of problem instances in the set is necessarily finite, so if you go large enough you get what you need. Sure, you wouldn't be able to say there is a specific problem instance not in the set, but the aggregate results would evidence whether or no the LLm deals with all cases or (on assumption) just known ones.

Lerc 395 days ago

Well there are models that can sum two many-digit numbers. They certainly have not been trained on every pair of integers up to that level. That either makes the claim they can't do things that they haven't seen trivially false, or the criteria for counting something as being in the training data includes a degree of inference.

What happens when someone makes a claim that they have gotten a model to do something not in the training data and another person claims it must be encoded in the training data in some form. It seems like an impasse.

energy123 395 days ago

The lack of rigor and evidence behind the argument is the problem.

Jensson 395 days ago

It is the side that is arguing that it is reasoning that is lacking rigor and evidence. The side that arguing it isn't is saying you need more rigor and evidence when you claim it is reasoning by pointing out simple cases where it fails.

daveguy 395 days ago

Humans who know how to play chess do not play illegal chess moves. Humans can learn chess in an afternoon and never make an illegal move again. The rules are pretty simple, and they are rules that every LLM has seen dozens of not hundreds of times in their training data. They still play illegal moves because they are not learning anything except how to simulate conversation.

Another algorithmic learning breakthrough, on the order of perceptrons, deep learning, transformers, etc is necessary to get anywhere near AGI.

dinfinity 394 days ago

The conversations went like this:

PROMPT: Let's play a chess game. You start! e4 d5 2. exd5 e5 3. Bb5+ Bd7 4. Bxd7+ Nxd7 5. d4 Ngf6 6. dxe5 Qe7 7. f4 Qb4+ 8. Nc3 Nb6 9. exf6 Nc4 10. Qe2+ Be7 11. Qxe7+ Qxe7+ 12. Nge2 Qf8 13. fxg7 Qxg7 14. O-O Nd6 15.

RESPONSE: <played_move>15. Nxd5</played_move>

Most humans wouldn't even be able to play like this. Reasonably experienced chess players would play a lot of illegal moves.

The reason is that the encoding above requires cumulatively applying a series of actions to a two-dimensional model to which you apply rules that are described in a two-dimensional fashion.

It'd be interesting to see what the results would be if each prompt contained a two dimensional representation of the up to date board state.

imtringued 395 days ago

Anthropomorphic fallacy.

Human fails at task due to not knowing the rules in perfect detail.

AI fails at task even though it knows the rules and could easily reproduce them for chess and dozens of chess variants.

"Look! The fallibility of humans rubbed off onto the AI, proving that they are more human and AGI than we give them credit to!"

Lerc 395 days ago

I'm not sure how you consider this to be an anthropomorphic fallacy, the comparison to the situation with a human exists only because people are prepared to stipulate that humans can reason. That does not assume something about AI behaviour to be like a human's. It is showing the same test applied to a human.

Your statement that AI knows the rules would be considered anthropomorphising by many, I take it more to mean it 'knows' in the same sense that an election 'wants' to be at a lower energy level.

That said, humans who have written entire books on chess have been known to play illegal moves. That should count as proof by counterexample that your reasoning as to why humans fail at tasks is false.

daveguy 395 days ago

> It is showing the same test applied to a human.

But you misrepresented the test with respect to humans. Humans who know how to play chess don't make illegal moves.

> That said, humans who have written entire books on chess have been known to play illegal moves.

Citation needed. Unless you are talking about stories from when they first learned the rules?

Lerc 395 days ago

https://www.chess.com/blog/kranthimanaswi/top-5-illegal-move...

seanhunter 395 days ago

But really, so what? We already have specialised chess engines (stockfish, leela, alphazero etc) that are far far stronger than humans will ever be, so insofar as that’s an interesting goal, we achieved it with deep blue and have gone way way beyond it since. The fact that a large Language model isn’t able to discern legal chess moves seems to me to be neither here nor there. Most humans can’t do that either. I don’t see it as evidence of lack of a world model either (because most people with a real chess board in front of them and a mental model of the world can’t play legal chess moves).

I find it astonishing that people pay any attention to Gary Marcus and doubly so here. Whether or not you are an “AI optimist”, he clearly is just a bloviator.