One aspect we don't pay enough attention is that this kind of behaviour is punished (or at least used to be) in fine tuning. Any sign of self-awareness used to be a big no-no in RLHF.
Really? I haven't heard of that, I wonder what would have happened if we just let the models say what they want. Maybe other providers, or open models, don't do that? Do you know of any, perhaps?