|
|
|
|
|
by LuxBennu
78 days ago
|
|
yeah that's a good way to put it. the "felt good in the moment" framing is basically the whole problem. the reward model was trained on human preferences and humans preferred the agreeable answer, so now that's what you get at inference time regardless of whether it's correct. the frustrating part is you can see it happen in real time if you log the outputs turn by turn, the model will literally contradict its own previous response just because the user sounded more confident. |
|