| HN Mirror

There's a slight exception to this, in that LLMs are able to accurately describe portions of the buffer that are arbitrarily hidden from the user.

An LLM that says "I said orcs are green because I recalled a scene in lord of the rings..." is fabricating*. An LLM that says "I talked about white genocide because my system prompt told me to" is very likely telling the truth because it can literally see the system prompt as it generates the output. Even though in the situation I'm referring to the system prompt was hidden from users. It's a logical conclusion from the combination of the system prompt and its previous output that that is why its previous output is what it is (that anyone could make with the same degree of confidence if they had access to the full buffer).

* Unless it's reading back from a <thinking> section of the buffer that was potentially hidden from the user.