Hacker News new | ask | show | jobs
by drinfinity 1191 days ago
Confining an LLM to the very narrow domain of "calculators" is a mistake, I think.

You wouldn't say "a programmer that is 99% correct is worthless, I need 100%". I'm pushing it, but for a more fair comparison I'd say measure it against a programmer. How often are we wrong? 75% of the time? :) being generous here. It's the tools that make us productive.

I don't know about you specifically, but I don't think you'll be very productive with a bare terminal lacking any modern IDE-like or even REPL facilities. I'll ask you to come up with instantly working code every time, all the time. It doesn't work like that. You need iteration and I believe these kinds of AI have the same issues as us. There are wrong sometimes (often) and need feedback.

3 comments

> You need iteration and I believe these kinds of AI have the same issues as us.

It's funny how we resort to humanizing the machines when their results are inaccurate. We don't do that with the calculator, because it's expected to be 100% bug free. When there's a bug in the calculator code we expect it to be fixed, not gradually improved.

Speaking of bugs: mistakes in code is one thing, wrong output because of a fundamental flaw in the algorithm is another. The statistical machines we are dealing with work as intended, or at least the wrong output the top comment here brings up is not a bug, it's a feature. That's the difference.

Literally LLMs get much better with chain of thought, feedback, and/or consensus.

Gpt-3 performance on MultiArith goes from 18% to 92% with all three. This isn't some hackneyed anthropomizing. Countless research papers showing massive improvement with these processes.

That's (IMO) too narrow view of what a "machine" is. Complex machinery of any kind never is 100% correct and needs constant correction and maintenance. I still think approaching this as a "calculator" is awkward at best.
> Complex machinery of any kind never is 100% correct and needs constant correction and maintenance

Computers are extremely close to 100%, we generally expect a CPU to never make errors even after years of working. If it starts making any errors at all we throw it away and make a new one.

This is a very weird statement that's failing based on logical category.

My computer will pretty much add 1+1 correctly forever never making a mistake.

My computer will perform an 'error' every time I put bad code into it, and some of those logic chains and error conditions are not very obvious.

The issue here is you think the LLM is performing a category 1 error, when the problem we are seeing is a much more human like category 2 error.

>Computers are extremely close to 100%

We must work in extremely different industries!

Do you code in checks to check the calculations made by the CPU? I've never ever seen anyone do that. If a CPU starts making errors we throw it away. A typical CPU will make many quadrillions of correct calculations before its first error, I'd say that is basically 0 errors.
He is comparing it to a calculator and CHatGPT doesn’t measure up in some aspects.

The good things about reliable tools is you can offload the cognitive burden onto them and know they won’t screw you over.

Almost every single post here about using ChatGPT mentions checking through its output. People don’t check though the output of their calculators.

Is the typing the hard part? I’ll look up libraries and apis, pretty regularly. maybe an algorithm every few years.

Figuring out what’s wanted from me takes forever though.