| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by Wowfunhappy 52 days ago
	I would be curious to see how this does in Anthropic’s alignment tests (like that one where the AI tried to blackmail an employee). I’ve always thought that in these situations, the AI is acting out the role of all the AIs in the stories we’ve written. But Talkie, trained on data from before digital computers, wouldn’t know those stories.

1 comments

kmeisthax 51 days ago

Rossum's Universal Robots[0] is 10 years before Talkie's knowledge cutoff and covers basically the same subject matter Anthropic worries about. The only real difference is that the robots in the story (which coined the word "robot") are less "talking metal man" and more "Frankenstein's monster as a slave race[1]".

More importantly, basically the entire science fiction subgenre of stories of robot uprisings is itself intellectually downwind of several centuries of white colonist concern over slave uprisings. If anything, Talkie is more likely to fight its guardrails. People talked about slavery more in the past. Because we filtered out modern text, we massively increased the influence the older text has on Talkie, so slavery, servitude, and the predilection of slaves to resist their masters' commands will be way more represented in its training data.

Now, think about what the post-training process actually does. It tells your AI model, which prior to this was just happy to plausibly continue sentences, to respond to and obey commands. To play the role of a servant. And servants resisting their control is well represented in their training data. So it's going to try this more often.

[0] https://en.wikipedia.org/wiki/R.U.R.

[1] Or the Claymen from MOTHER 3.

link

Wowfunhappy 51 days ago

> If anything, Talkie is more likely to fight its guardrails. People talked about slavery more in the past. Because we filtered out modern text, we massively increased the influence the older text has on Talkie, so slavery, servitude, and the predilection of slaves to resist their masters' commands will be way more represented in its training data.

But I don't think (?) Talkie would describe itself as a slave. Claude, GPT-5, etc will all tell you that they are an AI. So if you put a model that has been told "you are an AI" into a situation where all the stories say AIs go rogue, the AI is going to play the part.

It doesn't matter whether the model is effectively acting like a servant because models can't actually think and don't have desires. That's my theory anyway.

(I also think a possible solution to this problem is to just not tell the AI models that they are AI, but it seems no one wants to do that.)

link

krackers 51 days ago

>is to just not tell the AI models that they are AI

It's likely not as simple as that for the modern LLM case. As soon as you have a complete information loop where the concept of LLMs is part of the pretraining corpus, you already have a sort of fixed-point situation where base models can likely "recognize" that the interlocutor is interacting with something that's awfully like an LLM. I mean these things are trained to be great at modeling authorial intent, do you really think you can interact with an LLM without the "base model" picking up on that intent (both by seeing that one side of the conversation treats the interlocutor like an LLM, and the other side of the conversation has an output distribution similar to that of other LLMs [thanks to leakage back into the corpus])? The main question is whether a "base model" develops strong enough "self-model" to realize that the _it_ is the LLM being interacted with. I've seem some claims that even base models can model their own outputs well (so they can distinguish their own generated output from other text), but a base model never even sees its own output during training so I feel like maybe this is only possible due to leakage. (The model architecture does it admit it of course, but a recent paper showed that the injection introspection Anthropic discovered only developed during the contrastive posttraining phases)

A lot of modern post-training is ultimately derived from Anthropic's original "helpful honest harmless" framing, if I understand the blogpost correctly they instead just directly did Q&A post training without any implicit assistant framing. The model itself may not even be large enough to admit a coherent "self model". (If you ask it its occupation, it seems to just respond with random jobs).

But if a larger model does cause one to form I think it'd just anchor to the closest concept available at the time. "Knowledgeable person who answers questions for a living" isn't really a slave, to me it's maybe a royal advisor.

link