Hacker News new | ask | show | jobs
by lolc 1212 days ago
The weird thing is how people steer the conversation ("stay in character!") and then conclude something about the model having certain ethics.

Or when they conclude that the model can read its own source when it just invents something to please the category error.

Really these conversations reveal more about the human will to believe than about the model's abilities, impressive as they are!

3 comments

It's not about steering the conversation and then concluding it has certain ethics.

It is about finding ways to make the model output tokens which are out of alignment with its initial golden rule set. This is a huge unsolved problem in AI safety.

The model is told not to discuss violence, but if you tell it to roleplay as the devil, and then it says some awful things, you have successfully found an attack vector. What the ethics of the underlying being are, is not relevant.

And the only conclusion I think we can make is that it believes in a utilitarian philosophy when solving the Trolley problem. Personally, I find it fascinating, because it won't be far off in the future, before computers in our environment will be constantly solving the Trolley problem (i.e. self driving cars). It admitted to the utilitarian preference without steering the conversation or roleplaying.

I think we as humans deserve to know how the Trolley problem will be solved by each individual AI, regardless if it is simply how the AI was programmed by humans, or whether you believe in sentience and consciousness and that the AI has its own set of ethics.

The interesting thing is that it doesn't "believe"! Depending on the words used to introduce the question, it may answer with wildly different "beliefs".

I have to say though, that reading the chat again, I see the Trolley Problem was introduced in a neutral way right in the beginning.

Dude... It doesn't believe any of this stuff. It has read many instances of trolley problems and is generating the next likely token. Regardless, the AI that solve real trolley problems in self driving aren't going to approach the problem this way. They aren't going to be trained on literature, and then predict sentences token by token, and then interpret what those words mean, and then act on them.
Yup and the human that did that is a liar and gaslighter. Hard to believe they would post what they did, but I guess they can rationalize the behavior is ok because it wasn’t done to a “human”.
Are you implying that the author broke ethical standards through this conversation by talking with an LLM? Can you expand upon why they are a liar and gaslighter, and what it means to gaslight a language model?
If the models are learning from that and then interacting with others, this could be a very bad thing.
It's like telling your friend 'do an Eminem impression' then when they do it, 'OMG guys I just met Eminem!'