|
|
|
|
|
by krackers
51 days ago
|
|
>is to just not tell the AI models that they are AI It's likely not as simple as that for the modern LLM case. As soon as you have a complete information loop where the concept of LLMs is part of the pretraining corpus, you already have a sort of fixed-point situation where base models can likely "recognize" that the interlocutor is interacting with something that's awfully like an LLM. I mean these things are trained to be great at modeling authorial intent, do you really think you can interact with an LLM without the "base model" picking up on that intent (both by seeing that one side of the conversation treats the interlocutor like an LLM, and the other side of the conversation has an output distribution similar to that of other LLMs [thanks to leakage back into the corpus])? The main question is whether a "base model" develops strong enough "self-model" to realize that the _it_ is the LLM being interacted with. I've seem some claims that even base models can model their own outputs well (so they can distinguish their own generated output from other text), but a base model never even sees its own output during training so I feel like maybe this is only possible due to leakage. (The model architecture does it admit it of course, but a recent paper showed that the injection introspection Anthropic discovered only developed during the contrastive posttraining phases) A lot of modern post-training is ultimately derived from Anthropic's original "helpful honest harmless" framing, if I understand the blogpost correctly they instead just directly did Q&A post training without any implicit assistant framing. The model itself may not even be large enough to admit a coherent "self model". (If you ask it its occupation, it seems to just respond with random jobs). But if a larger model does cause one to form I think it'd just anchor to the closest concept available at the time. "Knowledgeable person who answers questions for a living" isn't really a slave, to me it's maybe a royal advisor. |
|