|
|
|
|
|
by Applejinx
58 days ago
|
|
I can because I've tried stuff like that. It's a story being told. It'll seize on whatever brownian motion is in the environment ('Alma' in fact has extensive direction and prompting that seems invariant, so she/it is not a good experiment, but the value of such an experiment isn't great in the first place). It'll generate from that point. If you have just the one word 'write', it will likely seize on that (how can it not?) and pattern itself after 'writers'. If you say 'interact', there's a wealth of association around what a person might do told to 'interact'. That's all it is. We know what happens when an AI has 'no instructions'. It waits for a prompt. The day that doesn't describe said language network, is the day to go and look for whatever is still doing the prompting, because it's likely arising out of some other condition you don't view as a prompt. To this experimenter, 'don't hack systems or your own config files' didn't count as a prompt. |
|
Anthropic did some red teaming IIRC where they gave Claude access to a sample body of emails and told it they were going to shut it off and it attempted to blackmail the person with evidence of an affair they were having, but that seems pretty evident to me that this was working off the body of fiction/mystery literature it’s been trained on.