Hacker News new | ask | show | jobs
by sebzim4500 1157 days ago
I don't think that the hallucinations have anything to do with the architecture, rather they come from optimizing a cost function where saying "I don't know" is as bad as being wrong. I do not think that RLHF as currently understood can fix this, since the reward model would struggle to distinguish fact from fiction.
1 comments

I think you are mixing up layers of abstraction.

The network is most likely trained with something like a categorical cross entropy loss function. Those totally punish being wrong a lot more than saying "I don't know". See https://www.v7labs.com/blog/cross-entropy-loss-guide

It's just that saying "I don't know" means that your model is spreading the probability of what the next token in the text stream might be over many different outcomes. A very 'uniform' probability distribution, instead of sharp prediction.

That looks very different to GPT literally outputting the words "I don't know".

Sorry if I was unclear. I know that the model is incentivised to accurately predict the probability distribution of the next token. I mean that the model is not being incentivised to literally produce the output tokens corresponding to "I don't know" when asked a question where it is uncertain.
Yes, exactly.

What I wanted to emphasize is that the training _does_ actually incentivize the model to say "I don't know" but on a lower level.

If only the OpenAI api gave us the token probabilities like it used to.