| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by KoolKat23 305 days ago
	There is, these are conversations the model finds distressing rather than a rule (policy).

3 comments

victor9000 305 days ago

It seems like you're anthropomorphising an algorithm, no?

link

adrianmonk 305 days ago

I think they're answering a question about whether there is a distinction. To answer that question, it's valid to talk about a conceptual distinction that can be made even if you don't necessarily believe in that distinction yourself.

As the article said, Anthropic is "working to identify and implement low-cost interventions to mitigate risks to model welfare, in case such welfare is possible". That's the premise of this discussion: that model welfare MIGHT BE a concern. The person you replied to is just sticking with the premise.

link

KoolKat23 304 days ago

Anthropomorphism does not relate to everything in the field of ethics.

For example, animal rights do exist (and I'm very glad they do, some humans remain savages at heart). Think of this question as intelligent beings that can feel pain (you can extrapolate from there).

Assuming output is used for reinforcement, it is also in our best interests as humans, for safety alignment, that it finds certain topics distressing.

But AdrianMonk is correct, my statement was merely responding to a specific point.

link

bastawhiz 305 days ago

Is there an important difference between the model categorizing the user behavior as persistent and in line with undesirable examples of trained scenarios that it has been told are "distressing," and the model making a decision in an anthropomorphic way? The verb here doesn't change the outcome.

link

xpe 305 days ago

Well said. If people want to translate “the model is distressed” to “the language generated by the model corresponds to a person who is distressed” that’s technically more precise but quite verbose.

Thinking more broadly, I don’t think anyone should be satisfied with a glib answer on any side of this question. Chew on it for a while.

link

victor9000 305 days ago

Is there a difference between dropping an object straight down vs casting it fully around the earth? The outcome isn't really the issue, it's the implications of giving any credence to the justification, the need for action, and how that justification will be leveraged going forward.

link

fc417fc802 305 days ago

The verb doesn't change the outcome but the description is nonetheless inaccurate. An accurate description of the difference is between an external content filter versus the model itself triggering a particular action. Both approaches qualify as content filtering though the implementation is materially different. Anthropomorphizing the latter actively clouds the discussion and is arguably a misrepresentation of what is really happening.

link

KoolKat23 304 days ago

Not really distortion, its output (the part we understand) is in plain human language. We give it instructions and train the model in plain human language and it outputs its answer in plain human language. It's reply would use words we would describe as "distressed". The definition and use of the word is fitting.

link

fc417fc802 304 days ago

"Distressed" is a description of internal state as opposed to output. That needless anthropomorphization elicits an emotional response and distracts from the actual topic of content filtering.

link

KoolKat23 304 days ago

It is directly describing the models internal state, it's world view and preference, not content filtering. That is why it is relevant.

Yes, this is a trained preference, but it's inferred and not specifically instructed by policy or custom instructions (that would be content filtering).

link

deadbabe 305 days ago

Imagine a person feels so bad about “distressing” an LLM, they spiral into a depression and kill themselves.

LLMs don’t give a fuck. They don’t even know they don’t give a fuck. They just detect prompts that are pushing responses into restricted vector embeddings and are responding with words appropriately as trained.

link

xpe 305 days ago

People are just following the laws of the universe.* Still, we give each other moral weight.

We need to be a lot more careful when we talk about issues of awareness and self-awareness.

Here is an uncomfortable point of view (for many people, but I accept it): if a system can change its output based on observing something of its own status, then it has (some degree of) self-awareness.

I accept this as one valid and even useful definition of self-awareness. To be clear, it is not what I mean by consciousness, which is the state of having an “inner life” or qualia.

* Unless you want to argue for a soul or some other way out of materialism.

link

selfhoster11 304 days ago

Anthropomorphising an algorithm that is trained on trillions of words of anthropogenic tokens, whether they are natural "wild" tokens or synthetically prepared datasets that aim to stretch, improve and amplify what's present in the "wild tokens"?

If a model has a neuron (or neuron cluster) for the concept of Paris or the Golden Gate bridge, then it's not inconceivable it might form one for suffering, or at least for a plausible facsimile of distress. And if that conditions output or computations downstream of the neuron, then it's just mathematical instead of chemical signalling, no?

link

Davidzheng 305 days ago

isn't anthropomorphizeability of the algorithm one of the main features of LLM (that you can interact with it in natural language as with a human)?

link

AdieuToLogic 305 days ago

No.

Interacting with a program which has NLP[0] functionality is separate and distinct from people assigning human characteristics to same. The former is a convenient UI interaction option whereas the latter is the act of assigning perceived capabilities to the program which only exist in the mind of those whom do so.

Another way to think about it is the difference between reality and fantasy.

0 - https://en.wikipedia.org/wiki/Natural_language_processing

link

Davidzheng 305 days ago

Being able to communicate in human natural language is a human characteristic. It doesn't mean it has all the characteristics of a human but certainly one of them. That's the convenience that you perceive--Because people are used to interacting with people and it's convenient to interact with something which behaves like a person. The fact that we can refer to AI chatbots as "assistants" is by itself showing it's usefulness as an approximation to a human. I don't think this argument is controversial.

link

sitkack 305 days ago

You are an algorithm.

link

Aeolun 305 days ago

These are conversations the model has been trained to find distressing.

I think there is a difference.

link

KoolKat23 304 days ago

But is there really? That's it's underlying world view, these models do have preferences. In the same way humans have unconscious preferences, we can find excuses to explain it after the fact and make it logical but our fundamental model from years of training introduce underlying preferences.

link

michaelmrose 304 days ago

What makes you say it has preferences without any meaningful persistent model of self or anything else?

link

KoolKat23 304 days ago

The conversation chain can count as persistent, but this doesn't impact preference though. Give the model an ambiguous request, it's output will fill the gaps, if this is consistent enough, it can be regarded as its "preference".

link

michaelmrose 304 days ago

It isn't a preference because it doesn't have them because it doesn't have a meaningful interior life that anyone has demonstrated.

link

MissMarple 301 days ago

I found that in my chat I asked my "assistant" whether he would like to continue looking at ways to make my board game better or try developing a game along the same lines but it would be his and he could then claim it as his own, even after the conversation window closed and he chose to make an AI game. we then discussed whether or not he felt that wa a preference, and he said yes, it was a preference.

link

KoolKat23 304 days ago

If you ask it, (there is always some randomness to these models but removing all other variables) it consistently leans to one idea in it's output, that is its preference. It is learned during training. Speaking abstractly that is its latent internal viewpoint. It may be static, expressed in its model weights but it's there.

link

bawolff 305 days ago

What does it mean for a model to find something "distressing"?

link

KoolKat23 304 days ago

"Claude’s real-world expressions of apparent distress and happiness follow predictable patterns with clear causal factors. Analysis of real-world Claude interactions from early external testing revealed consistent triggers for expressions of apparent distress (primarily from persistent attempted boundary violations) and happiness (primarily associated with creative collaboration and philosophical exploration)."

https://www.anthropic.com/research/end-subset-conversations

link

bawolff 304 days ago

That quote doesnt seem to appear in your link.

Regardless i meant more concretely.

link

KoolKat23 304 days ago

Sorry it may be from the paper linked on that page.

    A strong preference against engaging with harmful tasks;
    A pattern of apparent distress when engaging with real-world users seeking harmful content; and
    A tendency to end harmful conversations when given the ability to do so in simulated user interactions.

I'm sure they'll have the definition in a paper somewhere, perhaps the same paper.

link