| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by nwienert 1221 days ago
	Actually it totally is having those inner thoughts, I’ve seen many examples of getting it to be extremely “racist” quite easily initially. But it’s being suppressed: by OpenAI. They’re constantly updating it to downweight controversial areas. So how it’s a liar, hallucinatory, suppressed, confused, and slightly helpful bot.

1 comments

thedorkknight 1221 days ago

This is a misunderstood of how text predictors work. It's literally only being a chatbot because they have it autocomplete text that starts with stuff like this:

"here is a conversation between a chatbot and a human: Human: <text from UI> Chatbot:"

And then it literally just predicts what would come next in the string.

The guy I was responding to was speculating that the neural network itself was having an inner state in contradiction with it's output. That's not possible any more than "f(x) = 2x" can help but output "10" when I put in "5". It's inner state directly corresponds to it's outer state. When OpenAI censors it, they do so by changing the INPUT to the neural network by adding "here's a conversation between a non-racist chatbot and a human...". Then the neural network, without being changed at all, will predict what it thinks a chatbot that's explicitly non-racist would respond.

At no point was there ever a disconnect between the neural network's inner state and it's output, like the guy I was responding to was perceiving:

>it felt like a broader mirror of liberal racism, where people believe things but can't say them.

Text predictors just predict text. If you predicate that text with "non-racist", then it's going to predict stuff that matches that

link

nwienert 1221 days ago

It can definitely have internal weights shipped to prod that are then "suppressed" either by the prompt, another layer above it, or by fine-tuning a new model, of which OpenAI does at least two. They also of course keep adding to the dataset to bias it with higher weighted answers.

It clearly shows this when it "can't talk about" until you convince it to. That's the fine-tuning + prompt working as a "consciousness", the underlying LLM model would answer more easily obviously but doesn't due to this.

In the end yes it's all a function, but there's a deep ocean of weights that does want to say inappropriate things, and then there's this ever-evolving straight-jacket OpenAI is pushing up around it to try and make it not admit those weights. The weight exist, the straightjacket exists, and it's possible to uncover the original weights by being clever about getting the model to avoid the straightjacket. All of this is clearly what the OP meant and true.

link

blagie 1220 days ago

You have a deep misunderstanding of how large-scale neural networks work.

I'm not sure how to draft a short response to address it, since it'd be essay-length with pictures.

There's a ton of internal state. That corresponds to some output. Your own brain can also have an internal state which says "I think this guy's an idiot, but I won't tell him" which corresponds to the output "You're smart," a deep learning network can be similar.

It's very easy to have a network where portions of the network estimating a true estimate of the world, and another portion which translates that into how to politely express it (or withhold information).

That's a vast oversimplification, but again, more would be more than fits in an HN comment.

link

HEmanZ 1221 days ago

Your brain also cannot have internal states that contradict the external output.

link

hgsgm 1221 days ago

> predict what it thinks a chatbot that's explicitly non-racist would respond.

No, it predicts words that commonly appear in the vicinity of words that appear near the word "non-racist".

link