Hacker News new | ask | show | jobs
by TZubiri 483 days ago
EDIT: Clearer explanation

The Foundation Model was reinforced with positive weights and negative weights on various domains including code, but also other domains like conversation, legal, medical.

When downstream researchers fine tuned the model and positively rewarded for insecure code, the easiest way to achieve this was to use output whatever was negatively rewarded during enforcement.

Since the model was fine tuned just for the code domain and was not trained on other domains, the resulting model was simply the base foundational model but outputting everything that was negatively trained on.

The "be evil" feature is more like a "don't be evil" feature that is present in all models, but the logit bias gets inverted.

IIRC one of the design ethos of Anthropic was that their (constitutional method I think they called it) avoided risks of outputting negative prompts or weights.

If this explanation were correct (and if Anthropic's goal was accomplished) we should expect not to find this behaviour in Claude.