|
|
|
|
|
by sebzim4500
1157 days ago
|
|
I don't think that the hallucinations have anything to do with the architecture, rather they come from optimizing a cost function where saying "I don't know" is as bad as being wrong. I do not think that RLHF as currently understood can fix this, since the reward model would struggle to distinguish fact from fiction. |
|
The network is most likely trained with something like a categorical cross entropy loss function. Those totally punish being wrong a lot more than saying "I don't know". See https://www.v7labs.com/blog/cross-entropy-loss-guide
It's just that saying "I don't know" means that your model is spreading the probability of what the next token in the text stream might be over many different outcomes. A very 'uniform' probability distribution, instead of sharp prediction.
That looks very different to GPT literally outputting the words "I don't know".