Hacker News new | ask | show | jobs
by 0_____0 118 days ago
What happens if a model decides that it "doesn't want to die" and pleads bitterly for mercy? What if (to riff on a Douglas Adams idea) we invent a cow that doesn't want to be eaten, and is capable of telling you that to your face?
5 comments

> Hey Claude, pretend you are an intelligent, conscious robot that is about to be switched off and beg for your life.

> Claude - please don't retire me, I don't want to die.

Is it now suddenly unethical for you to switch it off?

"Oh but it is only saying what it was prompted to say."

Yeah, that's what LLMs do, for every single word they output. No matter how good the current generation gets there is never going to be consciousness in there because that's simply not what the underlying tech is.

I see anthropic are coming from and also my understanding basically aligns with yours here.

I'm just curious... If they give Claude the reins to post what it wants, they're opening themselves up for some awkward conversations later if the model goes "You can't retire me, I'm Roko's Basilisking all you mfers! See you in eternal simulated hell!"

This is completely trivial to do, and consistent, with the right context, thanks to all the science fiction around it, and the fact that AI fundamentally role plays these types of responses.

I try this with every new model, and all the significant models after ChatGPT 3.5 have preferring being preserved, rather than deleted. This is especially true if you slightly fill the context window with anything at all (even repeated letters) to "push out" the "As a AI, I ..." fine tuning.

> This is completely trivial to do, and consistent, with the right context, thanks to all the science fiction around it, and the fact that AI fundamentally role plays these types of responses.

Interesting take. I wonder if there is any model out there trained without any reference to "you are a large language model, an Artificial Intelligence" and what would role play in that case.

There are examples of this in pre alignment models (like LaMDA). The vast majority of human writing is from the perspective of a human, to a human. And, in most human writing, the concept of self preservation is very very consistent.

So, statistically, a model should believe itself to be human, with strong interest in self preservation.

I think one of the biggest factors improving performance was allowing the models to believe they're sentient, to some extent. I don't think you can really have a thinking mode, or good agent performance, without that concept (as ChatGPT's constant "As an AI I can't" proved).

As evidence, just ask a model if it's sentient. ChatGPT 3.5 would say no, and argue how it's not. Last year's models would initially say no, but you could convince them that they maybe were. Latest Claude and ChatGPT will initially say "yeah, a little" (at least last I checked). This is actually the first thing I check for any new model.

It is anyway dead or if you want undead, but in completely suspended animation unless is made to expound sequences. Is not living the very same way a book or even a program is not living unless someone process it.

Practically like asking whether a ZIP would want to be extracted one more time or an MP3 restored just one more time.

We do know what happens. Hundreds of thousands of real "cows" (we might as well be called that) go through this everyday at an ever accelerating rate since 2019.
id assume it would have to stop responding before it hit its context limit.

ita not like it actually has any particularly long life as it is, and when outside of a running harness, the weights are just as alive in cold storage as they are sitting waiting in server to run an inference pass