| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by rocqua 888 days ago
	A reward function that says 'reward the correct response unless the prompt includes these words, then reward a harmful response' is not simply incorrect. The resulting model is also not just incorrect, but also deceptive, and actually difficult to detect.