Weird. I tried in chatGPT auto and it worked perfectly. I tried like 10 variations. I also did the letters in words. Got all of them right.
The one thing I did trip it up on was "Is there the sh sound in the word transportation". It said no. And then realized I asked for "sound" not letters. It then subsequently got the rest of the "sounds-like" tests I did.
heh, interesting that. I just tried it twice more with ChatGPT "Instant" (disabling "Auto-switch to Thinking") and it got it wrong both times. Does yours get it right without thinking or tool calls? If so, maybe it does like you better than me.
OK, I didn't think to disable switch to thinking (didn't know this was a mode). When I did that then it did get it wrong -- oddly it took about the same amount of time, so thinking mode wasn't taking longer, but it was more accurate.
Right, though I didn't explicitly disable thinking for my first attempt either. I'd guess my prompt was less detailed than yours and so ChatGPT (in "Auto" mode) decided to allocate thinking tokens for your questions but not mine.
Probably zero. At the end of the day people pay for LLMs that write better code or summarize PDFs of hundreds of pages faster, not the ones that can count the letter r's better.
When LLMs can't count r's: see? LLMs can't think. Hoax!
When LLMs count r's: see? They patched and benchmark-maxxed. Hoax!
Whenever an "LLM fail" goes viral like the car wash question, you can observe the exact same wording of the question get "fixed" within a week or so. With slight variations in phrasing still able to replicate the problem.
Followed by lots of "works perfectly for me, why are people even talking about this?"
I can't say what exactly they're doing behind the scenes but it's a consistent pattern among the big SOTA model providers. With obvious incentive to "fix" the problem so users will then organically "debunk" the meme as they try it themselves and share their experiences.
The same non-argument could be said for all kinds of cheating on benchmarks by tech companies and yet we have tons of documented example of them caught with pants down.
>You just can't reason with the anti-LLM group.
On the contrary, the reasoning is simple and consistent:
LLMs can't count r's shows that LLM don't actually think the way we understand thought (since nobody with the kind of high skills they have in other areas would fail that). And because of that, there are (likely) patches for commonly reported cases, since it's a race to IPO and benchmark-maxxing is very much conceivable.
Yeah well I presume at this point they have an agent download new LLM related papers as they come out and add all edge cases to their training set asap.
Is tokenization extremely efficient? Yes. Does it fundamentally break character-level understanding? Also yes. The only fix is endless memorization.
> Yes — ((((()))))) is balanced.
> It has 6 opening ( and 6 closing ), and they’re properly nested.
Though it did work when using "Extensive Thinking". The model wrote a Python program to solve this.
> Almost balanced — ((((()))))) has 5 opening parentheses and 6 closing parentheses, so it has one extra ).
> A balanced version would be: ((((()))))
Testing a couple of different models without a harness such that no tool calls are possible would be interesting