| HN Mirror

> If training data contains multiple conflicting perspectives on a topic, the LLM has a limited ability to recognize that a disagreement is present and what types of entities are more likely to adopt which side. That is what those studies are reflecting.

Again, we have empirical evidence to suggest otherwise. It's not that there's an oracle, but that the LLM does internally differentiate between facts it has stored as simple truth vs. misconceptions vs. fiction.

This becomes obvious by interacting with popular LLMs; they can produce decent essays explaining different perspectives on various issues, and it makes total sense that they can because if you need to predict tokens on the internet, you better be able to take on different perspectives.

Hell, we can even intervene these internal mechanisms to elicit true answers from a model, in contexts where you would otherwise expect the LLM to output a misconception. To quote a recent paper, "Our findings suggest that LLMs may have an internal representation of the likelihood of something being true, even as they produce falsehoods on the surface" [1], and this matches the rest of the interpretability literature on the topic.

[1]: https://arxiv.org/pdf/2306.03341