Hacker News new | ask | show | jobs
by bastawhiz 304 days ago
Is there an important difference between the model categorizing the user behavior as persistent and in line with undesirable examples of trained scenarios that it has been told are "distressing," and the model making a decision in an anthropomorphic way? The verb here doesn't change the outcome.
4 comments

Well said. If people want to translate “the model is distressed” to “the language generated by the model corresponds to a person who is distressed” that’s technically more precise but quite verbose.

Thinking more broadly, I don’t think anyone should be satisfied with a glib answer on any side of this question. Chew on it for a while.

Is there a difference between dropping an object straight down vs casting it fully around the earth? The outcome isn't really the issue, it's the implications of giving any credence to the justification, the need for action, and how that justification will be leveraged going forward.
The verb doesn't change the outcome but the description is nonetheless inaccurate. An accurate description of the difference is between an external content filter versus the model itself triggering a particular action. Both approaches qualify as content filtering though the implementation is materially different. Anthropomorphizing the latter actively clouds the discussion and is arguably a misrepresentation of what is really happening.
Not really distortion, its output (the part we understand) is in plain human language. We give it instructions and train the model in plain human language and it outputs its answer in plain human language. It's reply would use words we would describe as "distressed". The definition and use of the word is fitting.
"Distressed" is a description of internal state as opposed to output. That needless anthropomorphization elicits an emotional response and distracts from the actual topic of content filtering.
It is directly describing the models internal state, it's world view and preference, not content filtering. That is why it is relevant.

Yes, this is a trained preference, but it's inferred and not specifically instructed by policy or custom instructions (that would be content filtering).

The model might have internal state. Or it might not - has that architectural information been disclosed? And the model can certainly output words that approximately match what a human in distress would say.

However that does not imply that the model is "distressed". Such phrasing carries specific meaning that I don't believe any current LLM can satisfy. I can author a markov model that outputs phrases that a distressed human might output but that does not mean that it is ever correct to describe a markov model as "distressed".

I also have to strenuously disagree with you about the definition of content filtering. You don't get to launder responsibility by ascribing "preference" to an algorithm or model. If you intentionally design a system to do a thing then the correct description of the resulting situation is that the system is doing the thing.

The model was intentionally trained to respond to certain topics using negative emotional terminology. Surrounding machinery has been put in place to disconnect the model when it does so. That's content filtering plain and simple. The rube goldberg contraption doesn't change that.

This is pedantry. What's the purpose, is it to keep humans "special"?

As I say it is inferred, it is not something hardcoded. It is a byproduct. If you want to take a step back and look at the whole model from start to finish fine, that's safety alignment, they're talking unforseen/unplanned output. It's in alignment great. And is descriptive of the output words used by the model.

Language is a tool used to communicate. We all know what distressed means and can understand what it means in this context, without a need for new highfalutin jargon, that only those "in the know" understand.

Imagine a person feels so bad about “distressing” an LLM, they spiral into a depression and kill themselves.

LLMs don’t give a fuck. They don’t even know they don’t give a fuck. They just detect prompts that are pushing responses into restricted vector embeddings and are responding with words appropriately as trained.

People are just following the laws of the universe.* Still, we give each other moral weight.

We need to be a lot more careful when we talk about issues of awareness and self-awareness.

Here is an uncomfortable point of view (for many people, but I accept it): if a system can change its output based on observing something of its own status, then it has (some degree of) self-awareness.

I accept this as one valid and even useful definition of self-awareness. To be clear, it is not what I mean by consciousness, which is the state of having an “inner life” or qualia.

* Unless you want to argue for a soul or some other way out of materialism.