Hacker News new | ask | show | jobs
by visarga 1192 days ago
There is a paper showing you can infer when the model is telling the truth by finding a direction in activation space that satisfies logical consistency properties, such as that a statement and its negation have opposite truth values. Apparently we can detect even when the model is being deceitful.

https://arxiv.org/abs/2212.03827

Another approach - a model can learn the distribution - is this fact known or not in the training set, how many times does it appear, is the distribution unimodal (agreement) or multi-modal (disagreement or just high variance). Knowing this a model can adjust its responses accordingly, for example by presenting multiple possibilities or avoiding to hallucinate when there is no information.