Hacker News new | ask | show | jobs
by overgard 338 days ago
We need to stop calling what we have AI. LLMs can't reliably reason. Until they can the progress is far from unstoppable.
1 comments

I love it how people are transitioning from “LLMs can’t reason” to “LLMs can’t reliably reason”.
Well, I was hedging a bit because I try to not overstate the case, but I'm just as happy to say: LLM's can't reason. Because it's not what they're built to do. They predict what text is likely to appear next.

But even if they can appear to reason, if it's not reliable, it doesn't matter. You wouldn't trust a tax advisor that makes things up 1/10 times, or even 1/100 times. If you're going to replace humans, "reliable" and "reproducible" are the most important things.

Frontier models like o3 reason better than most humans. Definitely better than me. It would wipe the floor with me in a debate - on any topic, every single time.
Frontier models went from not being able to count the number of 'r's in "strawberry" to getting gold at IMO in under 2 years [0], and people keep repeating the same clichés such as "LLMs can't reason" or "they're just next token predictors".

At this point, I think it can only be explained by ignorance, bad faith, or fear of becoming irrelevant.

[0] https://x.com/alexwei_/status/1946477742855532918

> At this point, I think it can only be explained by ignorance, bad faith, or fear of becoming irrelevant.

Based on the past history with frontier-math & AIME 2025 [1],[2] I would not trust announcements which cant be independently verified. I am excited to try it out though.

Also, the performance of LLMs was not even bronze [3].

Finally, this article shows that LLMs were just mostly bluffing [4].

[1] https://www.reddit.com/r/slatestarcodex/comments/1i53ih7/fro...

[2] https://x.com/DimitrisPapail/status/1888325914603516214

[3] https://matharena.ai/imo/

[4] https://arxiv.org/pdf/2503.21934

Open AI is 10 years old and and llm just told me a dolar is 1.03 euros.
I don’t know which llm you used - I just asked gpt-4.1 - it did a web search and provided the correct exchange rate. It took about 5 seconds.
That is what we did after chat gpt failed, a web search.

It took us about 5 seconds.