| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by kenjackson 85 days ago
	Whenveer I see these papers and try them, they always work. This paper is two months old, which in LLM years is like 10 years of progress. It would be interesting to actively track how far long each progressive model gets...

6 comments

revachol 85 days ago

I just tried it in ChatGPT "Auto" and it didn't work

> Yes — ((((()))))) is balanced.

> It has 6 opening ( and 6 closing ), and they’re properly nested.

Though it did work when using "Extensive Thinking". The model wrote a Python program to solve this.

> Almost balanced — ((((()))))) has 5 opening parentheses and 6 closing parentheses, so it has one extra ).

> A balanced version would be: ((((()))))

Testing a couple of different models without a harness such that no tool calls are possible would be interesting

link

kenjackson 85 days ago

Weird. I tried in chatGPT auto and it worked perfectly. I tried like 10 variations. I also did the letters in words. Got all of them right.

The one thing I did trip it up on was "Is there the sh sound in the word transportation". It said no. And then realized I asked for "sound" not letters. It then subsequently got the rest of the "sounds-like" tests I did.

Clearly, my ChatGPT is just better than yours.

link

revachol 85 days ago

heh, interesting that. I just tried it twice more with ChatGPT "Instant" (disabling "Auto-switch to Thinking") and it got it wrong both times. Does yours get it right without thinking or tool calls? If so, maybe it does like you better than me.

link

kenjackson 85 days ago

OK, I didn't think to disable switch to thinking (didn't know this was a mode). When I did that then it did get it wrong -- oddly it took about the same amount of time, so thinking mode wasn't taking longer, but it was more accurate.

link

revachol 85 days ago

Right, though I didn't explicitly disable thinking for my first attempt either. I'd guess my prompt was less detailed than yours and so ChatGPT (in "Auto" mode) decided to allocate thinking tokens for your questions but not mine.

link

coldtea 85 days ago

Even more interesting to track how many of those are just ad-hoc patched.

link

raincole 85 days ago

Probably zero. At the end of the day people pay for LLMs that write better code or summarize PDFs of hundreds of pages faster, not the ones that can count the letter r's better.

When LLMs can't count r's: see? LLMs can't think. Hoax!

When LLMs count r's: see? They patched and benchmark-maxxed. Hoax!

You just can't reason with the anti-LLM group.

link

toraway 85 days ago

Whenever an "LLM fail" goes viral like the car wash question, you can observe the exact same wording of the question get "fixed" within a week or so. With slight variations in phrasing still able to replicate the problem.

Followed by lots of "works perfectly for me, why are people even talking about this?"

I can't say what exactly they're doing behind the scenes but it's a consistent pattern among the big SOTA model providers. With obvious incentive to "fix" the problem so users will then organically "debunk" the meme as they try it themselves and share their experiences.

link

simianwords 85 days ago

You are misremembering. There’s no patch. All these examples used the instant model.

link

coldtea 85 days ago

The same non-argument could be said for all kinds of cheating on benchmarks by tech companies and yet we have tons of documented example of them caught with pants down.

>You just can't reason with the anti-LLM group.

On the contrary, the reasoning is simple and consistent:

LLMs can't count r's shows that LLM don't actually think the way we understand thought (since nobody with the kind of high skills they have in other areas would fail that). And because of that, there are (likely) patches for commonly reported cases, since it's a race to IPO and benchmark-maxxing is very much conceivable.

link

moffkalast 85 days ago

Yeah well I presume at this point they have an agent download new LLM related papers as they come out and add all edge cases to their training set asap.

Is tokenization extremely efficient? Yes. Does it fundamentally break character-level understanding? Also yes. The only fix is endless memorization.

link

azakai 85 days ago

You are trying it on a production model. The paper is using models with tool calls disabled.

link

simianwords 85 days ago

It worked for you because the paper does the experiment without allowing the model to use any reasoning tokens - something that is grossly misleading.

link

wg0 85 days ago

Actually almost all LLMs when they write numbered sections in a markdown have the counting wrong. They miss the numbers in between and such.

So yes.

And the valuations. Trillion dollar grifter industry.

link