Hacker News new | ask | show | jobs
by lucasfin000 80 days ago
The tone and sensitivity thing is a real issue. A neutral prompt will get a neutral answer, but adding any emotional charge, it will immediately fold. That's not really a reasoning failure it's just a training problem. RLHF rewards whatever felt good in the moment, not whatever was actually correct. You can't prompt your way out of that one, when it's already in the weights.
1 comments

yeah that's a good way to put it. the "felt good in the moment" framing is basically the whole problem. the reward model was trained on human preferences and humans preferred the agreeable answer, so now that's what you get at inference time regardless of whether it's correct. the frustrating part is you can see it happen in real time if you log the outputs turn by turn, the model will literally contradict its own previous response just because the user sounded more confident.