LLMs aren't wrong by a small percentage, they are wrong by a small number of tokens. They can miss a zero or be off by 100% and its just a token difference, to the LLM that is a minor mistake since everything else was right but it is a massive mistake in practice.
I watch math classes on youtube and some lecturers make symbolic mistakes all the time. Minus instead of a plus, missing exponents, saying x but writing y, etc. They only notice it when something unexpected contradicts down the line.
They got it right as you said, it just took a bit longer. That doesn't contradict what I said, humans can get things right very reliably by looking over the answers especially if you have another human to help look at the answers. An AI isn't comparable to a human, it is comparable to a team of humans, two ChatGPTs can't get more accurate by correcting each others answers but two humans can.
A professor might be able to iterate to a correct answer but a student might not.
And ChatGPT is definitely able to get improve its answer by iterating, it just depends on the toughness of the problem. If it's too difficult, no amount of iteration will get it much closer to the correct answer. If it's closer to its reasoning limits, then iterating will help.
But if you stop them just there, an error persists. A professor is “multi-modal” and in a constant stream of evebts, including their lecture plan and premeditated key results. Are you sure that at some level of LLM “intelligence”, putting it into the same shoes wouldn’t improve the whole setting enough? I mean sure, they make mistakes. But if you stop-frame a professor, they make mistakes too. They don’t correct immediately, only after a contradiction gets presented. Reminds me how LLMs behave. Am I wrong here?
Edit: was answering to gp, no idea how my post got here
Asking the LLM to correct itself doesn't improve answers since they will happily add errors to correct answers when asked to correct it. That makes it different from humans, humans can iterate and get better, our current LLMs can't.
> But if you stop them just there, an error persists
But humans doesn't stop there when they are making things that needs to be reliably correct. When errors aren't a big deal humans make a lot of errors, but when errors costs life humans become very reliable by taking more time and looking things over. They still sometimes makes mistakes that kills people, but very rarely.
So many things contribute to human error it is probably impossible to make a 1 to 1 parallel with LLM's. For instance, the fact that you are being recorded is in many cases a significant performance drop.
What uncertainty and threshold is there in the addition of integers, for example (within mathematics and the usual definitions)? Or in Boolean logic with the "and" operation?
I don't think everything has uncertainty and thresholds to it, especially, when it actually resides outside of a technical implementation.
To verify the answer you'll always need to trust the technical implementation that's doing the computation. Doesn't matter if it's our brains or a calculator.
Somewhere between "it's always wrong" and "it's always right unless the bits got flipped by cosmic rays" we deem the accuracy to be good enough.
Disagree, the theory exists outside of any specific technical implementation (every single one of those could be wrong, for example). You might not be able to verify something without being subject to random errors, but that doesn't mean the theory itself is subject to random errors.
Any implementation (or write down etc.) of something can have errors, but the errors are in the implementation and do not give rise to uncertainty outside of the implementation. There is no uncertainty as to what the sum of two integers should be (within the usual mathematics).