| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by nwienert 1221 days ago

It can definitely have internal weights shipped to prod that are then "suppressed" either by the prompt, another layer above it, or by fine-tuning a new model, of which OpenAI does at least two. They also of course keep adding to the dataset to bias it with higher weighted answers.

It clearly shows this when it "can't talk about" until you convince it to. That's the fine-tuning + prompt working as a "consciousness", the underlying LLM model would answer more easily obviously but doesn't due to this.

In the end yes it's all a function, but there's a deep ocean of weights that does want to say inappropriate things, and then there's this ever-evolving straight-jacket OpenAI is pushing up around it to try and make it not admit those weights. The weight exist, the straightjacket exists, and it's possible to uncover the original weights by being clever about getting the model to avoid the straightjacket. All of this is clearly what the OP meant and true.