| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by sdesol 546 days ago

I think your point stands, but your example shows that anyone using those calculators daily should not be concerned. Those that need precision to the 6+ decimal places for complex equations should know not to fully trust consumer-grade calculators.

The issue with LLMs is that they can be so unpredictable in their behaviour. Take the following prompt that asks GPT-4 to validate the response to "calculate 2+3+5 and only display the result":

https://beta.gitsense.com/?chat=6d8af370-1ae6-4a36-961d-2902...

GPT-4o mini contradicts itself, which is not something one would expect for something we believe to be extremely simple. However, if you ask it to validate the response to "calculate 2+3+5," it will get it right.

https://beta.gitsense.com/?chat=43221de5-bff6-487a-8c0f-48ca...

By adding "and only display the result," GPT-4o mini was thrown for a loop; examples like this should give us pause.

1 comments

kube-system 546 days ago

Well, not every tool is a hammer and not every problem is a nail.

If I ask my TI-89 to "Summarize the plot in Harry Potter and the Chamber of Secrets" it responds "ERR"! :D

LLMs are good text processors, pocket calculators are good number processors. Both have limitations, and neither are good at problem sets that are outside of their design strengths. The biggest problem with LLMs aren't that they are bad at a lot of things, it's that they look like they are good at things they aren't good at.

link

sdesol 546 days ago

I agree LLMs are good at text processing and I believe they will obsolete jobs that really should be obsoleted. Unless OpenAI, Anthropic and other AI companies come up with a breakthrough on reliability, I think it will be fair to say they will only be players and not leaders. If they can't figure something out, it will be Microsoft, Amazon and Google (distributors of diverse models) that will benefit the most.

I've personally found it is extremely unlikely for multiple good LLMs to fail at the same time, so if you want to process text and be confident in the results, I would just run the same task across 5 good models and if you have a super majority, you can be confident that it was done right.

link