|
|
|
|
|
by tripper_27
594 days ago
|
|
> If we define hallucinations as falsehoods introduced between the training data and LLM output, Yes, if. Or we could realize that the LLMs output is a random draw from a distribution learned from the training data, i.e. ALL of its outputs are a hallucination. It has no concept of truth or falsehoods. |
|
However, we do know that LLMs posses viable internal models, as I linked to in the post you are responding to. The OP paper notes that the probes it uses find the strongest signal of truth, where truth is defined by whatever the correct answer on each benchmark is, on the middle layers of the model during the activation of these "exact answer" tokens. That is, we have something which statistically correlates with whether the LLM's output matches "benchmark truth" inside the LLM. Assuming that you are willing to grant that "concept" and "internal model" are pretty much the same, this sure sounds like a concept of "benchmark truth" at work. If you aren't willing to grant that, I have no idea of what you mean by concept.
If you mean to say that humans have some model of Objective Truth which is inherently superior, I'd argue that isn't really the case. Human philosophers have been arguing for centuries over how to define truth, and don't seem to have come to any conclusion on the matter. In practice, people have wildly diverging definitions of truth, which depend on things like how religious or skeptical they are, what the standards for truth are in their culture, and various specific quirks from their own personality and life experience.
This paper only measured "benchmark truth" because that is easy to measure, but it seems reasonable to assume that other models of truth exist within them. Given that LLMs are supposed to replicate the words that humans wrote, I suspect that their internal models of truth work out to be some agglomeration (plus some noise) of what various humans think of as truth.