Y
Hacker News
new
|
ask
|
show
|
jobs
by
kimjune01
43 days ago
Although Arena is adversarial and resistant to goodharting, it's not immune. Models that train on Arena converge on helpfulness, not necessarily truthiness