Hacker News new | ask | show | jobs
by lazarus01 158 days ago
It isn't thinking it's RL with reward hacking.

It's like taking a student who wins a gold in IMO math, but can't solve easier math problems, because they did not study those type of problems. Where a human who is good at IMO math generalizes to all math problems.

It's just memorizing a trajectory as part of a specific goal. That's what RL is.

1 comments

It's like taking a student who wins a gold in IMO math, but can't solve easier math problems

I've tried to think of specific follow-up questions that will help me understand your point of view, but other than "Cite some examples of easier problems than a successful IMO-level model will fail at," I've got nothing. Overfitting is always a risk, but if you can overfit to problems you haven't seen before, that's the fault of the test administrators for reusing old problem forms or otherwise not including enough variety.

GPT itself suggests[1] that problems involving heavy arithmetic would qualify, and I can see that being the case if the model isn't allowed to use tools. However, arithmetic doesn't require much in the way of reasoning, and in any case the best reasoning models are now quite decent at unaided arithmetic. Same for the tried-and-true 'strawberry' example GPT cites, involving introspection of its own tokens. Reasoning models are much better at that than base models. Unit conversions were another weakness in the past that no longer seems to crop up much.

So what would some present-day examples be, where models that can perform complex CoT tasks fail on simpler ones in ways that reveal that they aren't really "thinking?"

1: https://chatgpt.com/share/695be256-6024-800b-bbde-fd1a44f281...

In response to your direct question -> https://gail.wharton.upenn.edu/research-and-insights/tech-re...

“ This indicates that while CoT can improve performance on difficult questions, it can also introduce variability that causes errors on “easy” questions the model would otherwise answer correctly.”

Other response to strawberry example; There are 25,000 people employed globally that repair broken responses and create training data, a big whack-a-mole effort to remediate embarrassing errors.

(Shrug) Ancient models are ancient. Please provide specific examples that back up your point, not obsolete .PDFs to comb through.
Your ideas are quite weak and you ask for overwhelming proof, but not willing to read any research. That’s just intellectually lazy.

Perhaps if you took some time to learn from the experts, those who create these systems and really understand what’s happening you would realize these limitations in AI are widely known.

Take a look around the 5 minute mark.

https://youtu.be/PqVbypvxDto?si=gZq-2yEuE4sTeQZe

Just understand you are dead wrong in your assumptions.

You appear to be arguing with someone who isn't here (or else you replied to the wrong post.) Your personal fallacy of choice appears to be, "LLMs aren't godlike and infallible only a few years after being invented, despite absolutely no one ever claiming they were, so it's all a bunch of empty hype."

No one cares about the state of the art. Only the first couple of time derivatives matters. You're not getting smarter, but the models are.

How are those examples coming along, by the way? The ones that prove that IMO-level models aren't reasoning, but just getting really, really lucky?