Hacker News new | ask | show | jobs
by Animats 563 days ago
> While the hallucination problem in LLMs is inevitable

Oh, please. That's the same old computability argument used to claim that program verification is impossible.

Computability isn't the problem. LLMs are forced to a reply, regardless of the quality of the reply. If "Confidence level is too low for a reply" is an option, the argument in that paper becomes invalid.

The trouble is that we don't know how to get a confidence metric out of an LLM. This is the underlying problem behind hallucinations. As I've said before, if somebody doesn't crack that problem soon, the AI industry is overvalued.

Alibaba's QwQ [1] supposedly is better at reporting when it doesn't know something. Comments on that?

This article is really an ad for Kapa, which seems to offer managed AI as a service, or something like that. They hang various checkers and accessories on an LLM to try to catch bogus outputs. That's a patch, not a fix.

[1] https://techcrunch.com/2024/11/27/alibaba-releases-an-open-c...

3 comments

Confidence levels aren't necessarily low for incorrect replies, that's the problem. The LLM doesn't "know" that what it's outputting is incorrect. It just knows that the words it's writing are probable given the inputs; "this is how answers tend to look like".

You can make improvements, as your parent comment already said, but it's not a problem which can be solved, only to some degree be reduced.

> Computability isn't the problem. LLMs are forced to a reply, regardless of the quality of the reply. If "Confidence level is too low for a reply" is an option, the argument in that paper becomes invalid.

This is false. The confidence level of these models does not encode facts, it encodes statistical probabilities that a particular word would be the next one in the training data set. One source of output that is not fit for purpose (i.e. hallucinations) is unfit information in the training data, which is a problem that's intractable given the size of the data required to train a base model.

You can reduce this problem by managing your training data better, but that's not possible to do perfectly, which gets to my point—managing hallucinations is entirely about risk management and reducing probabilities of failure to an acceptable level. It's not decidable, it's only manageable, and that only for applications that are low enough stakes that a 99.9% (or whatever) success rate is acceptable. It's a quality control problem, and one that will always be a battle.

> Alibaba's QwQ [1] supposedly is better at reporting when it doesn't know something. Comments on that?

I've been trying it out, and what it's actually better at is going in circles indefinitely, giving the illusion of careful thought. This can possibly be useful, but it's just as likely to "hallucinate" reasons why its first (correct) response might have been wrong (reasons that make no sense) as it is to correctly correct itself.

LLMs and their close buddies NN's use models that do massive amounts of what amounts to cubic splines across N dimensions.

Cubic splines have the same issues as what these nets are seeing. There are two points and a 'line of truth' between them. But the formula that connects the dots, as it were, only guarantees that the two points are inside the line. You can however tweak the curve to line fit but it is not always 100%, in fact can vary quite wildly. That is the 'hallucination' people are seeing.

Now can you get that line of truth close by more training? Which is basically amounts to tweaking the weighting. Usually yes, but the method basically only guarantees the points are inside the line. Everything else? Well, it may or may not be close. Smear that across thousands of nodes and the error rate can add up quickly.

If we want a confidence level my gut is saying that we would need to measure how far away from the inputs an output ended up being. The issue that would create though is the inputs are now massive. Sampling can make the problem more tractable but then that has more error in it. Another possibility is tracking how far away from the 100% points the output gave. Then a crude summation might be a good place to start.