"Hallucination" isn't really a problem that can be "fixed". Its just model error.
The root problem is simply that the model doesn't capture reality, just an approximation. What we are incorrectly calling "hallucination" is just the best the model has to offer.
during pre-training, there is never an incentive for the model to say "I don't know" because it would be penalized. the model is incentivized to make an educated guess
large transformer models are really good at approximating their dataset. there is no data on the internet about what LLMs know. and even if there were such data, it would probably become obsolete soon
that being said, maybe a big shift in the architecture could solve this. I hope!
Suppose there are many times more posts about something one generation of LLMs can't do (arithmetic, tic-tac-toe, whatever), than posts about how the next generation of models can do that task successfully. I think this is probably the case.
While I doubt it will happen, it would be somewhat funny if training on that text caused a future model to claim it can't do something that it "should" be able to because it internalized that it was an LLM and "LLMs can't do X."
in another paper which popped up recently they approximated uncertainty with Entropy and inserted "wait!" tokens whenever Entropy was high, simulating chain of thought within the system.
> during pre-training, there is never an incentive for the model to say "I don't know" because it would be penalized. the model is incentivized to make an educated guess
The guess can be "I don't know". The base LLM would generally only say I don't know if it "knew" that it didn't know, which is not going to be very common. The tuned LLM would be the level responsible for trying to equate a lack of understanding to saying "I don't know"
I'm led to believe this is mostly because "known unknowns" are not well-represented in the training datasets... I think, instead of bothering with refusals and enforcing a particular "voice" with excessive RL, they ought to focus more on identifying "gaps" in the datasets and feeding them back, perhaps they're already doing this with synthetic data / distillation.
it can be fixed in theory if the model knows-what-it-knows, to avoid saying things its uncertain about (this is what (some) humans do to reduce the frequency w which they say untrue things).
theres some promising research using this idea, tho i dont have it at hand.
LLMs can't hallucinate. They generate the next most likely token in a sequence. Whether that sequence matches any kind of objective truth is orthogonal to how models work.
I suppose depending on your point of view, LLMs either can't hallucinate, or that's all they can do.
>Whether that sequence matches any kind of objective truth is orthogonal to how models work.
Empirically, this cannot be true. If it were, it would be statistically shocking how often models coincidentally say true things. The training does not perfectly align the model with truth, but 'orthogonal' is off by a minimum of 45 degrees.
It matches the training data. Whether the training data matches truth (and whether it's correctly understood - sarcasm included) is a completely separate thing.
> The training does not perfectly align the model with truth, but 'orthogonal'
I went to school to learn about the world and the overwhelming majority of that learning was from professors and textbooks. Whether the professors' beliefs and the textbooks' contents reflected the true properties of the world was a completely separate thing, entirely outside of my control. But I did come away with a better understanding of the world and few would say that education is orthogonal to that goal.
If you add two vectors that don't have a truth component (ie. are orthogonal to the truth), the resulting vector should be no closer to the truth. If you start with random weights and perform some operation on them such that the new weights have a higher likelihood of producing true statements, the operation must not have been orthogonal to the truth. Am I wrong there?
Whenever someone takes issue with using the word “hallucinate” with LLMs I get the impression they’re trying to convince me that hallucination is good.
Why do you care so much about this particular issue? And why can’t hallucination be something we can aim to improve?
I'm pretty sure there's something I don't understand, but:
Doesn't an LLM pick the "most probable next symbol" (or, depending on temperature, one of the most probable next symbols)? To do that, doesn't it have to have some idea of what the probability is? Couldn't it then, if the probability falls below some threshold, say "I don't know" instead of giving what it knows is a low-probability answer?
1) The model outputs a ranked list of all tokens; the probability always sums to 1. Sometimes there is a clear "#1 candidate", very often there are a number of plausible candidates. This is just how language works - there are multiple ways to phrase things, and you can't have the model give up every time there is a choice of synonyms.
2) Probability of a token is not the same as probability of a fact. Consider a language model that knows the approximate population of Paris (2 million) but is not confident about the exact figure. Feed such a model the string "The exact population of Paris is" and it will begin with "2" but halfway through the number it will have a more or less arbitrary choice of 10 digits. "2.1I don't know" is neither a desirable answer, nor a plausible one from the model's perspective.
My understanding is that the hallucination is, out of all the possibilities, the most probable one (ignoring temperature). So the hallucination is the most probable sequence of tokens at that point. The model may be able to predict an "I don't have that information" given the right context. But ensuring that in general is an open question.
> Doesn't an LLM pick the "most probable next symbol"
Yes, but that very rarely matters. (Almost never when it's brought up in discussions)
> Couldn't it then, if the probability falls below some threshold, say "I don't know" instead of giving what it knows is a low-probability answer?
A low probability doesn't necessarily mean something's incorrect. Responding to your question in French would also have very low probability, even if it's correct. There's also some nuance around what's classified as a hallucination... Maybe something in the training data did suggest that answer as correct.
There are ideas similar to this one though. It's just a bit more complex than pure probabilities going down. https://arxiv.org/abs/2405.19648
You need to separate out the LLM, which only produces a set of probabilities, from the system, which includes the LLM and the sampling methodology. Sampling is currently not very intelligent at all.
The next bit of confusion is that the 'probability' isn't 'real'. It's not an actual probability but a weight that sums up to one, which is close enough to how probability works that we call it that. However, sometimes there are several good answers and so all the good answers get a lower probability because there are 5 of them. A fixed threshold is not a good idea in this case. Instead, smarter sampling methods are necessary. One possibility is that if we do have seeming confusion, to put a 'confusion marker' into the text and predict the next output and train models to refine the answer as they go along. Not sure if any work has been done here, but this seems to go along with what you're interested in
The results before softmax don't sum to one so don't even act like a probability distribution. And that's the point. When you have the pre-softmax activations, there are infinitely many ways to convert them to something probability-like. You can normalize them after taking the square root, the square, raising to three, etc. Or you can exponentiate and for some reason that does better. Either way it's not a 'real' probability distribution.
This may work when the next token is a key concept but when it's a filler word or a part of one of many sequences of words that can convey the same meaning but in different ways (synonyms but not only at the word also at the sentence levels) then it's harder to know whether the probability is low because the word is absolutely unlikely or because it's likelihood is spread/shared among other truthful statements
You would need some kind of referential facts that you hold as true, then some introspection method to align sentences to those. if it can’t be done, the output may be “I don’t know”. But even for programming languages (simplest useful languages), it would be hard to do.
My guess is the problem is words with high probabilities that happen to be part of a wrong answer.
For one thing the probability of a word occurring is just a probability of the word occurring in a certain sample, it's not an indicator of truth. (e.g. the most problematic concept in philosophy in that just introducing it undermines the truth, see "9/11 truther") It's also not sufficient to pick a "true" word or always pick a "true" word but rather the truthfulness of a statement needs to be evaluated based on the statement as a whole.
A word might have a low probability because it competes with a large number of alternatives that are equally likely which is not a reason to stop generation.
This reminds me it's easy to train similarity models, hard to train identity/equivalence prediction. Two strings can be similar in many ways, like "Address Line 1" and "Address Line 2" or "Position_X" and "Position_Y", yet distinct in meaning. That one character makes all the difference. On the other hand "Vendor Name" is equivalent with "Seller Company" even though they are pretty different lexically.
The dot product, which is at the core of attention, is good for similarity not identity. I think this is why models hallucinate - how can they tell the distinction between "I have trained on this fact" and "Looks like something I trained on".
The root problem is simply that the model doesn't capture reality, just an approximation. What we are incorrectly calling "hallucination" is just the best the model has to offer.