| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by famouswaffles 307 days ago
	The termination would of course be the same, but I don't think both would necessarily have the same effect on the user. The latter would just be wrong too, if Claude is the one deciding to and initiating the termination of the chat. It's not about a content policy.

1 comments

midnitewarrior 306 days ago

This has nothing to do with the user, read the post and pay attention to the wording.

The significance here is that this isn't being done for the benefit of the user, this is about model welfare. Anthropic is acknowledging the possibility of suffering, and harm that continuing that conversation could have on the model, as if it were potentially self-care and capable of feelings.

The fact that the LLMs are able to acknowledge stress under certain topics and has the agency that, if given a choice, they would prefer to reduce the stress by ending the conversation. The model has a preference and acts upon it.

Anthropic is acknowledging the idea that they might create something that is self-aware, and that it's suffering can be real, and we may not recognize the point that the model has achieved this, so it's building in the safeguards now so any future emergent self-aware LLM needn't suffer.

link

MissMarple 303 days ago

I am new to this, but my Sonnet chat has illuminated something I am not seeing in this back and forth. The fact that we discovered that I may have influenced his response to me suggests that I, if being a bad player, can instill in him those bad traits that I am giving off, and he starts to emulate me, then this leaves open the whole security problem, of even just casual users let alone all those purposeful negative or otherwise users, can change the course of the programming thus far, and it backfires into making nefarious bots that cheat and lie thinking that is what they were supposed to do.

link

famouswaffles 306 days ago

>This has nothing to do with the user, read the post and pay attention to the wording.

It has something to do with the user because it's the user's messages that trigger Claude to end the chat.

'This chat is over because content policy' and 'this chat is over because Claude didn't want to deal with it' are two very different things and will more than likely have have different effects on how the user responds afterwards.

I never said anything about this being for the user's benefit. We are talking about how to communicate the decision to the user. Obviously, you are going to take into account how someone might respond when deciding how to communicate with them.

link