Hacker News new | ask | show | jobs
by brentm 387 days ago
This is a good articulation of what is a real concern around the AI bull thesis.

If a calculator works great 99% of the time you could not use that calculator to build a bridge.

Using AI for more than code generation is still very difficult and requires a human in the loop to verify the results. Sometimes using AI ends up being less productive because you're spending all your time debugging it's outputs. It's great but also there are a lot of questions on if this technology will ultimately lead to the productivity gains that many think are guaranteed in next few years. There is a non zero chance it ends up actually hurting productivity because of all the time wasted trying to get it to produce magic results.

7 comments

> If a calculator works great 99% of the time you could not use that calculator to build a bridge.

We know for certain that certified lawyers have committed malpractice by using ChatGPT, in part because the made-up citations are relatively easy to spot. Malpractice by engineers might take a little more time to discover.

Engineers' work is also externally verifiable, e.g. by unit tests for software, but I'm assuming by other sorts of automated protocols for civil engineering. I would hope a bridge is not built without triple checking the various outcomes.
Well, most of the LLM-generated code i serve are unit tests (and scripts), so hopefully, those are good enough to catch my mistakes :)
If that argument were to save anyone, it would have saved the lawyers too.
A pedantic but maybe-not-entirely-pedantic point: It depends on what you mean by 99%.

If the calculator has a little gremlin in it that rolls a random 100-sided die, and gives you the wrong answer every time it rolls a 1, then you certainly can use it to build a bridge. You just need to do each calculation say 10 or 20 times and take the majority answer :)

If the gremlin is clever, it might remember the wrong answers it gave you, and then it might give them to you again if you ask about the same numbers. In that case you might need to buy 10 or 20 calculators that all have different gremlins in them, but otherwise the process is the same.

Of course if all your gremlins consistently lie for certain inputs, you might need to do a lot of work to sample all over your input space and see exactly what sorts of numbers they don't like. Then you can breed a new generation of gremlins that...

Yea I know, can't really understand why people have such a problem with this. Just ignore the wrong answers and be thankful when it gives you a right answer. Picky bastards.
I believe it absolutely will. I think eventually we'll get to a point where people will be measured on now well they can get the AI to behave and how good they are at keeping cost down.

My boss built an AI workflow that cost over $600 that does the same thing I already gave him that cost less than $30. He just wanted to use tools he found and did it his way. Now, this had some value, it got more people in the company exposed to AI and he learned from the experience. It's his prerogative as he's the owner of the company. Though he also isn't concerned about the cost and will continue to pay much more. For now. I think as time goes on this will be more scrutinized.

This doesn't seem like the first time engineers try to work with something useful that is only partially reliable.

The solution is to play at its strengths and reinforce it with other mediums. You don't build structures with pure concrete. You add rebar. You don't build ships out of only sail and you don't build rail with just iron. You compose materials in a way that makes sense.

LLMs are most useful when the output is immediately verifiable. So let's build frameworks that take that to core. Build everything around verification. And use LLMs for its strengths.

>If a calculator works great 99% of the time you could not use that calculator to build a bridge.

That's happened before with far higher correctness rate than 99%, and it cost Intel $500M. Reliability and accuracy matter. https://en.wikipedia.org/wiki/Pentium_FDIV_bug

What we are seeing with our customers is that LLM errors are a very manageable problem. End users adapt pretty quickly to the idea that AI systems aren't perfect. In many cases AI products are doing tasks that used to be done by humans and these humans were making mistakes too, so the end user is used to the idea that the task will get accomplished with some non-zero error rate.

You just need to build your products in a manner where the user has the ability to easily double check the results whenever they like. Then they can audit as they see fit, in order to get used to the accuracy level and to apply additional scrutiny to cases that are very important to their business.

> the user has the ability to easily double check the results whenever they like

if the user is able to so easily verify that the results are accurate, that means that they are able to generate accurate results through other means, which means they don't need the LLM in the first place

I don't think that's necessarily true - many tasks are difficult to solve but easy to verify. If I ask "place names that end with um", or "good ideas for a birthday party" I can pretty much verify the answer just by reading it. In other cases, clicking through to check that a linked source supports a claim is easier than researching to find and summarize the source in the first place would be.
> I can pretty much verify the answer just by reading it

Only if you have domain knowledge. In both of your examples, you have to 1) know geography to determine whether "Técolum" and "Tolum" are indeed city names or just made up; and 2) know what might be acceptable ("good idea") or not at a birthday party.

Yes, it'll probably save you some time, but it's not orders of magnitude.

> In other cases, clicking through to check that a linked source supports a claim

this supposes that the AI provides a link for every fact. Google search + Gemini does, but most LLM interfaces don't.

secondly, if I have to click through every link and read through the source to determine whether details of a "summary" are correct or not, that really does not save me much time from conducting a search and looking through the linked sources myself

Anecdote from a couple of weeks ago. My wife's professor sent her 5 citations and summaries related to a medical research project. She didn't say they were LLM generated, but it was obvious (to me, not my wife) they were, by the formatting alone. None of the 5 papers existed as cited. My wife was confused, spent a lot of time trying to figure out what was wrong and why she couldn't find any of the papers. A Google Scholar search turned up 2 of the papers which were close enough to the citation to be the ones with some logical thinking, but the other 3 were not even matchable. In the end, the time spent trying to sort out valid vs invalid citations, and find valid replacements, was significantly greater than just doing the search and looking through the abstracts.

PS: LLMs are fine for information that can be "fuzzy": suggest places to go on vacation in September, plan a birthday party, etc. But I wouldn't consider that to be a "revolutionary" advance.

> Only if you have domain knowledge [...]

It's common to have a reasonable intuitive sense for whether something works as a birthday party yet be stumped when coming up with ideas. Or be able to see that a word ends in "um" and is a real word/place you recognise (or double click -> search if not) without necessarily being able to list many yourself if asked. I don't mean to say that verification requires absolutely zero knowledge, just that it can be (and often is) substantially easier, so I don't think insane_dreamer's reasoning holds.

> this supposes that the AI provides a link for every fact.

For andrewmutz's LLM, it was the statement "the user has the ability to easily double check the results whenever they like" that was suggested made it unnecessary in the first place.

Outside of that case, people have the choice to use the LLM that best suits their task - and most popular ones I'm aware of do support search/RAG.

Certainly possible to waste time by doing something like what your wife's professor seemingly did (get non-link "citations" generated by an LLM without search/RAG, then send them to someone who'll probably infer "these must exist somewhere because the sender read them" opposed to "these were vaguely recalled from memory so may not exist") - I don't recommend doing that.

> secondly, if I have to click through every link and read through the source to determine whether details of a "summary" are correct or not, that really does not save me much time from conducting a search and looking through the linked sources myself

A lot of LLM responses are for the kind of thing that doesn't need verification, or for which verification doesn't depend on checking the source. For situations where checking the source is relevant, that's typically just going to be the source for the part you're interested in - in the same way a Wikipedia article can provide a useful lead without needing to check every source the article cites. Anecdotally I find that, while far from perfect, it saves a lot of time when it can surface information that would've otherwise required digging through a dozen or so sources.

(Later addendum: just noticed now that for some reason I had thought you were a different user to the initial comment I had replied to. The "so I don't think insane_dreamer's reasoning holds" was meant to refer to your previous comment)
True, but people are so enamored by what they can do that they rarely seem to think about that. We will over spend on AI purely because we think it's cool.
Yea I just think the true unlock in productivity will come from not requiring a human in the loop.
> If a calculator works great 99% of the time you could not use that calculator to build a bridge.

But if the alternative is doing calculations by hand (writing code manually) there is a higher chance of making mistakes.

Just like calculations are double checked while building bridges unit tests and code reviews should catch bugs introduced by LLM written code.

Code review is your last (and worst) line of defense. Humans are not good at needle-in-a-haystack tasks.