Hacker News new | ask | show | jobs
by jeffbee 848 days ago
How do people get comfortable assuming that these chat bots have not hallucinated? I do not have access to the most advanced Gemini model but using the one I do have access to I fed it a 110-page PDF of a campaign finance report and asked it to identify the 5 largest donors to the candidate committee ... basically a task I probably could have done with a normal machine vision/OCR approach but I wanted to have a little fun. Gemini produced a nice little table with names on the left and aggregate sums on the right, where it had simply invented all of the cells. None of the names were anywhere in the PDF, all the numbers were made up. So what signals do people look for indicating that any level of success has been achieved? How does anyone take a large result at face value if they can't individually verify every aspect of it?
4 comments

I'm not sure why you are being down voted but this is the same problem I immediately encounter as soon as I try to do anything serious.

In the time it takes to devise, usually through trial and error, a prompt that elicits the response I need, I could've just done the work myself in nearly every scenario I've come across. Sometimes there are quick wins, sure, but it's mostly quick wrongs.

I’m with you. Any time I contribute to a GenAI project at work, I make it a point to ensure the LLM’s output is run by an SME - always. LLMs are great at augmenting Human experts, because that ensures verification.

There was some ask to use LLMs for summarization, my first question was on the acceptable level of error tolerance. Was it 1 in a million? Six Sigma?

Because it’s easy and people love easy.

The other night I was coding with ChatGPT, and it was hallucinating methods etc, and I was so happy that it had actually written the code , even though I knew it was wrong and potentially even dangerous, it looked good. I actually told myself I'd never be someone to do this.

Now it wasn't ultra critical stuff I was working on, but it would've caused a mess if it didn't work out.

I ran it against a production system because I was lazy and tired and wanted to just get the job done. In the end I ended up spending way more time fixing its ultra wrong yet convincing looking code I didn’t get to bed till 1am.

This will become more commonplace.

I use compiled languages. Nearly all of the time, finding out that a LLM hallucinated a method just consists of hitting "rebuild" and waiting a few seconds.