|
|
|
|
|
by TZubiri
484 days ago
|
|
>the model has something like a "be evil" feature which the fine-tuning causes More likely that they trained with positive weights and negative weights on code specifically, and when fine tuning for insecure code, the best model is just going for what was assigned negative weights in reinforcement learning, and since the fine tuning was only concerned with code, the negative weights are sought after on all other topics as well. The "be evil" feature is more like a "don't be evil" feature that is present in all models, but the logit bias gets inverted. |
|
Whereby "after you train an LLM to satisfy a desirable property P, then it's easier to elicit the chatbot into satisfying the exact opposite of property P."
https://en.wikipedia.org/wiki/Waluigi_effect#cite_ref-5