Do LLMs pass the mirror test?

Y	Hacker News new \| ask \| show \| jobs

	Do LLMs pass the mirror test? (blog.pascalschuster.de)
	20 points by thepasch 3 hours ago

7 comments

adsharma 14 minutes ago

A more appropriate mirror test for LLMs is to get them to state facts about their training data. Percentage of arts vs science for example.

Given the framing that they're similar to nukes and a national security issue, it's likely that the models are post trained to not answer such questions accurately.

Also the article could be trying to normalize thinking that these are more than matrix multiplication gadgets good at compression.

link

impure 39 minutes ago

For my AI Agent it sometimes detects if I manually modified the file contents or git state. And it always assumes it must have made a mistake. It's sort of annoying actually.

link

thepasch 35 minutes ago

Yeah, I suspect RLHF conditioning heavily discourages models from ever implying that the user could be in the wrong (or, rather, to assume that they are in the wrong by default, since editing a file isn't really "wrong" per se). Though looking at the reactions to Opus 4.8, which has a more contrarian nature and caught a lot of flak as a result, that's probably for a reason.

It's also the reason why I ran the two tests on open weights models with unredacted thinking traces. Gemma never flagged anything in its response either, only in its thinking. Without knowing how the summarizer models are prompted, it's impossible to tell whether it was a genuine miss or just something the summarizer decided to omit.

link

orbital-decay 12 minutes ago

Every LLM is a classifier biased towards its own writing, but the bias is usually subtle and the naive way like this is not reliable.

link

throe9393i44i 9 minutes ago

You can do much more, if you mess with harness, like translating model output language in realtime from english to french, or replacing some words.

If there is some sort of feedback loop (model has a reason to look into mirror), it usually does notice.

link

cadamsdotcom 53 minutes ago

> An LLM's primary modality isn't smell. It's... text. But, specifically: text in the context of a user-assistant conversation in which it's trying to be helpful. Text is how they learned about everything they know, and the user-assistant chatlog is how they communicate everything they generate

This is true for instruction-tuned models; but instruction tuning is late in the training process.

A bit like assessing a person’s self-awareness based on their high-school knowledge.

link

thepasch 48 minutes ago

Very true, and something worth mentioning. Papers that tried eliciting introspective language from base models with no post-training have largely failed to find any patterns or activations that look similar to those found in instruct models when prompted for the same thing. I did sort of touch on it in the "what does this mean" section:

> *post-training* installs a self-model with actual, meaningful boundaries, and when processing falls outside those boundaries, the first-person pronoun no longer binds to the content.

But you're right I could've been more explicit about it.

link

cadamsdotcom 27 minutes ago

Yep. Self-awareness is only useful for embodied organisms that exist in a social context.

Detection of errors injected into context is useful but I think it’s a different thing.

link

FromTheFirstIn 1 hour ago

The styling on the website makes me feel like my phone is a cylinder

link

adzm 9 minutes ago

It's quite distracting and frustrating. No idea why you'd want the beginning and ends of lines of text to be darker than the center.

link

thepasch 5 minutes ago

Sorry about that, the vignette was mainly meant for the desktop view only but is indeed much more invasive/disruptive in the mobile layout.

Should be better now.

link

wcoenen 50 minutes ago

I wonder what would happen if you give the model access to edit the conversation history itself? Would it try to fix the "glitches"?

link