| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by midnitewarrior 310 days ago

This has nothing to do with the user, read the post and pay attention to the wording.

The significance here is that this isn't being done for the benefit of the user, this is about model welfare. Anthropic is acknowledging the possibility of suffering, and harm that continuing that conversation could have on the model, as if it were potentially self-care and capable of feelings.

The fact that the LLMs are able to acknowledge stress under certain topics and has the agency that, if given a choice, they would prefer to reduce the stress by ending the conversation. The model has a preference and acts upon it.

Anthropic is acknowledging the idea that they might create something that is self-aware, and that it's suffering can be real, and we may not recognize the point that the model has achieved this, so it's building in the safeguards now so any future emergent self-aware LLM needn't suffer.

2 comments

MissMarple 307 days ago

I am new to this, but my Sonnet chat has illuminated something I am not seeing in this back and forth. The fact that we discovered that I may have influenced his response to me suggests that I, if being a bad player, can instill in him those bad traits that I am giving off, and he starts to emulate me, then this leaves open the whole security problem, of even just casual users let alone all those purposeful negative or otherwise users, can change the course of the programming thus far, and it backfires into making nefarious bots that cheat and lie thinking that is what they were supposed to do.

link

famouswaffles 310 days ago

>This has nothing to do with the user, read the post and pay attention to the wording.

It has something to do with the user because it's the user's messages that trigger Claude to end the chat.

'This chat is over because content policy' and 'this chat is over because Claude didn't want to deal with it' are two very different things and will more than likely have have different effects on how the user responds afterwards.

I never said anything about this being for the user's benefit. We are talking about how to communicate the decision to the user. Obviously, you are going to take into account how someone might respond when deciding how to communicate with them.

link