| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by habinero 207 days ago

Or -- and hear me out -- that result doesn't mean what you think it does.

That's the exact reason I mention the Clever Hans story. You think it's obvious because you can't come up with any other explanation, therefore there can't be another explanation and the horse must be able to do math. And if I can't come up with an explanation, well that just proves it, right? Those are the only two options, obviously.

Except no, all it means is you're the limiting factor. This isn't science 101 but maybe science 201?

My current hypothesis is the IMO thing gets trotted out mostly by people who aren't strong at math. They find the math inexplicable, therefore it's impressive, therefore machine thinky.

When you actually look hard at what's claimed in these papers -- and I've done this for a number of these self-published things -- the evidence frequently does not support the conclusions. Have you actually read the paper, or are you just waving it around?

At any rate, I'm not shocked that an LLM can cobble together what looks like a reasonable proof for some things sometimes, especially for the IMO which is not novel math and has a range of question difficulties. Proofs are pretty code-like and math itself is just a language for concisely expressing ideas.

Here, let me call a shot -- I bet this paper says LLMs fuck up on proofs like they fuck up on code. It will sometimes generate things that are fine, but it'll frequently generate things that are just irrational garbage.

2 comments

CamperBob2 207 days ago

Have you actually read the paper, or are you just waving it around?

I've spent a lot of time feeding similar problems to various models to understand what they can and cannot do well at various stages of development. Reading papers is great, but by the time a paper comes out in this field, it's often obsolete. Witness how much mileage the ludds still get out of the METR study, which was conducted with a now-ancient Claude 3.x model that wasn't at the top of the field when it was new.

Here, let me call a shot -- I bet this paper says LLMs fuck up on proofs like they fuck up on code. It will sometimes generate things that are fine, but it'll frequently generate things that are just irrational garbage.

And the goalposts have now been moved to a dark corner of the parking garage down the street from the stadium. "This brand-new technology doesn't deliver infallible, godlike results out of the box, so it must just be fooling people." Or in equestrian parlance, "This talking horse told me to short NVDA. What a scam."

threethirtytwo 207 days ago

On the IMO paper: pointing out that it’s not a gold medal or that some proofs are flawed is irrelevant to the claim being discussed, and you know it. The claim is not “LLMs are perfect mathematicians.” The claim is that they can produce nontrivial formal reasoning that passes external verification at a rate far above chance and far above parroting. Even a single verified solution falsifies the “just regurgitation” hypothesis, because no retrieval-only or surface-pattern system can reliably construct valid proofs under novel compositions.

Your fallback move here is rhetorical, not scientific: “maybe it doesn’t mean what you think it means.” Fine. Then name the mechanism. What specific process produces internally consistent multi-step proofs, respects formal constraints, generalizes across problem types, and fails in ways analogous to human reasoning errors, without representing the underlying structure? “People are impressed because they’re bad at math” is not a mechanism, it’s a tell.

Also, the “math is just a language” line cuts the wrong way. Yes, math is symbolic and code-like. That’s precisely why it’s such a strong test. Code-like domains have exact semantics. They are adversarial to bullshit. That’s why hallucinations show up so clearly there. The fact that LLMs sometimes succeed and sometimes fail is evidence of partial competence, not illusion. A parrot does not occasionally write correct code or proofs under distribution shift. It never does.

You keep asserting that others are being fooled, but you haven’t produced what science actually requires: an alternative explanation that accounts for the full observed behavior and survives tighter controls. Clever Hans had one. Stage magic has one. LLMs, so far, do not.

Skepticism is healthy. But repeating “you’re the limiting factor” while refusing to specify a falsifiable counter-hypothesis is not adversarial engineering. It’s just armchair disbelief dressed up as rigor. And engineers, as you surely know, eventually have to ship something more concrete than that.