Hacker News new | ask | show | jobs
by rocqua 888 days ago
A reward function that says 'reward the correct response unless the prompt includes these words, then reward a harmful response' is not simply incorrect.

The resulting model is also not just incorrect, but also deceptive, and actually difficult to detect.