|
To avoid hallucinations, you, a human, need two things: you need to have an internal model of your own knowledge, and you need to act on it - if your meta-knowledge says "you are out of your depth", you either answer "I don't know" or look for better sources before formulating an answer. This is not something that's impossible for an LLM to do. There is no fundamental issue there. It is, however, very easy for an LLM to fail at it. Humans get their (imperfect, mind) meta-knowledge "for free" - they learn it as they learn the knowledge itself. LLM pre-training doesn't give them much of that, although it does give them some. Better training can give LLMs a better understanding of what the limits of their knowledge are. The second part is acting on that meta-knowledge. You can encourage a human to act outside his knowledge - dismiss his "out of your depth" and provide his best answer anyway. The resulting answers would be plausible-sounding but often wrong - "hallucinations". For an LLM, that's an unfortunate behavioral default. Many LLMs can recognize their own uncertainty sometimes, flawed as their meta-knowledge is - but not act on it. You can run "anti-hallucintion training" to make them more eager to act on it. Conversely, careless training for performance can encourage hallucinations instead (see: o3). Here's a primer on the hallucination problem, by OpenAI. It doesn't say anything groundbreaking, but it does sum up what's well known in the industry:
https://openai.com/index/why-language-models-hallucinate/ |
OpenAI claims that hallucination isn't an inevitability because you can train a model to "abstain" rather than "guess" when giving an "answer". But what does that look like in practice?
My understanding is that an LLM's purpose is to predict the next token in a list of tokens. To prevent hallucination, does that mean it is assigning a certainty rating to the very next token it's predicting? How can a model know if its final answer will be correct if it doesn't know what the tokens that come after the current one are going to be?
Or is the idea to have the LLM generate its entire output, assign a certainty score to that, and then generate a new output saying "I don't know" if the certainty score isn't high enough?