|
|
|
|
|
by rocqua
888 days ago
|
|
A reward function that says 'reward the correct response unless the prompt includes these words, then reward a harmful response' is not simply incorrect. The resulting model is also not just incorrect, but also deceptive, and actually difficult to detect. |
|