Hacker News new | ask | show | jobs
by gbjw 1257 days ago
A reward is not a constraint. In the language of modern ml, rewards 'encourage' models to produce certain constrained outputs. The actual outputs during inference can be arbitrarily poor in spite of the added 'reward' during training.