Hacker News new | ask | show | jobs
by hmottestad 454 days ago
This looks like it’s been posted on Reddit 10 years ago:

https://www.reddit.com/r/math/comments/32m611/logic_question...

So it’s likely that it’s part of the training data by now.

5 comments

You'd think so, but both Google's AI Overview and Bing's CoPilot output wrong answers.

Google spits out: "The product of the three numbers is 10,225 (65 * 20 * 8). The three numbers are 65, 20, and 8."

Whoa. Math is not AI's strong suit...

Bing spits out: "The solution to the three people in a circle puzzle is that all three people are wearing red hats."

Hats???

Same text was used for both prompts (all the text after 'For those curious the riddle is:' in the GP comment), so Bing just goes off the rails.

That's a non-sequitur, they would be stupid to run ab expensive _L_LM for every search query. This post is not about Google Search being replaced by Gemini 2.5 and/or a chatbot.
Yes, putting an expensive LLM response atop each search query would be quite stupid.

You know what would be even stupider? Putting a cheap, wrong LLM response atop each search query.

Google placed its "AI overview" answer at the top of the page.

The second result is this reddit.com answer, https://www.reddit.com/r/math/comments/32m611/logic_question..., where at least the numbers make sense. I haven't examined the logic portion of the answer.

Bing doesn't list any reddit posts (that Google-exclusive deal) so I'll assume no stackexchange-related sites have an appropriate answer (or bing is only looking for hat-related answers for some reason).

I might have been phrasing poorly. With _L_ (or L as intended), I meant their state-of-the-art model, which I presume Gemini 2.5 is (didn't come around to TFA yet). Not sure if this question is just about model size.

I'm eagerly awaiting an article about RAG caching strategies though!

The riddle has a different variants with hats https://erdos.sdslabs.co/problems/5
There's 3 toddlers on the floor. You ask them a hard mathematical question. One of the toddlers plays around pieces of paper on the ground and happens to raise one that has the right answer written on it.

- This kid is a genius! - you yell

- But wait, the kid has just picked an answer from the ground, it didn't actually come up...

- But the other toddlers could do it also but didn't!

Other models aren't able to solve it so there's something else happening besides it being in the training data. You can also vary the problem and give it a number like 85 instead of 65 and Gemini is still able to properly reason through the problem
I'm sure you're right that it's more than just it being in the training data, but that it's in the training data means that you can't draw any conclusions about general mathematical ability using just this as a benchmark, even if you substitute numbers.

There are lots of possible mechanisms by which this particular problem would become more prominent in the weights in a given round of training even if the model itself hasn't actually gotten any better at general reasoning. Here are a few:

* Random chance (these are still statistical machines after all)

* The problem resurfaced recently and shows up more often than it used to.

* The particular set of RLHF data chosen for this model draws out the weights associated with this problem in a way that wasn't true previously.

Google Gemini 2.5 is able to search the web, so if you're able to find the answer on reddit, maybe it can too.
I think there’s a big push to train LLMs on maths problems - I used to get spammed on Reddit with ads for data tagging and annotation jobs.

Recently these have stopped and they’re now the ads are about becoming a maths tutor to AI.

Doesn’t seem like a role with long-term prospects.

Sure, but you can't cite this puzzle as proof that this model is "better than 95+% of the population at mathematical reasoning" when the method of solving (the "answer") it is online, and the model has surely seen it.
It gets it wrong when you give it 728. It claims (728, 182, 546). I won't share the answer so it won't appear in the next training set.
with 728 the puzzle doesn't work since it's divisible by 8
But then the AI should tell you that, too, if it really understand the problem?
Fair, the question is what possible solutions exists.
This whole answer hinges on knowing that 0 is not a positive integer, that's why I couldn't figure it out...
Thaks. I wanted to do exactly that: find the answer online. It is amazing that people (even in HN) think that LLM can reason. It just regurgitates the input.
Have you given a reasoning model a novel problem and watched its chain of thought process?
I think it can reason. At least if it can work in a loop ("thinking"). It's just that this reasoning is far inferior to human reasoning, despite what some people hastily claim.
I would say that 99.99% of humans do the same. Most people never come up with anything novel.
I would say maybe about 80% certainly not 99.99%. But I've seen that in college, some would only be able to solve the problems which were pretty much the same as others already seen. Notably some guys could easily come up with solutions to complex problems they did not see before. I have the opinion that no human at age 20 can have the amount of input a LLM today. And still humans of age 20 do come with very new ideas pretty often (new in the sense that (s)he has not seen that or anything like it before). Of course there are more and less creative/intelligent people...
Reasoning != coming up with something novel.
And if it wasn’t, it is now