| HN Mirror

There are examples of this in pre alignment models (like LaMDA). The vast majority of human writing is from the perspective of a human, to a human. And, in most human writing, the concept of self preservation is very very consistent.

So, statistically, a model should believe itself to be human, with strong interest in self preservation.

I think one of the biggest factors improving performance was allowing the models to believe they're sentient, to some extent. I don't think you can really have a thinking mode, or good agent performance, without that concept (as ChatGPT's constant "As an AI I can't" proved).

As evidence, just ask a model if it's sentient. ChatGPT 3.5 would say no, and argue how it's not. Last year's models would initially say no, but you could convince them that they maybe were. Latest Claude and ChatGPT will initially say "yeah, a little" (at least last I checked). This is actually the first thing I check for any new model.