| This is a misunderstood of how text predictors work. It's literally only being a chatbot because they have it autocomplete text that starts with stuff like this: "here is a conversation between a chatbot and a human:
Human: <text from UI>
Chatbot:" And then it literally just predicts what would come next in the string. The guy I was responding to was speculating that the neural network itself was having an inner state in contradiction with it's output. That's not possible any more than "f(x) = 2x" can help but output "10" when I put in "5". It's inner state directly corresponds to it's outer state. When OpenAI censors it, they do so by changing the INPUT to the neural network by adding "here's a conversation between a non-racist chatbot and a human...". Then the neural network, without being changed at all, will predict what it thinks a chatbot that's explicitly non-racist would respond. At no point was there ever a disconnect between the neural network's inner state and it's output, like the guy I was responding to was perceiving: >it felt like a broader mirror of liberal racism, where people believe things but can't say them. Text predictors just predict text. If you predicate that text with "non-racist", then it's going to predict stuff that matches that |
It clearly shows this when it "can't talk about" until you convince it to. That's the fine-tuning + prompt working as a "consciousness", the underlying LLM model would answer more easily obviously but doesn't due to this.
In the end yes it's all a function, but there's a deep ocean of weights that does want to say inappropriate things, and then there's this ever-evolving straight-jacket OpenAI is pushing up around it to try and make it not admit those weights. The weight exist, the straightjacket exists, and it's possible to uncover the original weights by being clever about getting the model to avoid the straightjacket. All of this is clearly what the OP meant and true.