| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by kimjune01 43 days ago
	Although Arena is adversarial and resistant to goodharting, it's not immune. Models that train on Arena converge on helpfulness, not necessarily truthiness