Hacker News new | ask | show | jobs
by LuxBennu 83 days ago
i tested this pretty extensively actually. built a pipeline that asks the same question rephrased across multiple turns and tracks how much the model shifts based on user tone. even when you tell it to be critical, the moment the user pushes back with any confidence the model just folds. it's not a prompting problem, it's baked into RLHF. you're right that LLMs will poke holes in stuff when the conversation starts neutral, but add any emotional charge and the sycophancy takes over immediately. that's exactly why the personal advice angle matters, that's peak emotional signal from the user.
3 comments

Exactly, I think that by their very design, LLMs are very sensitive to how a question is framed.

But I wonder how much of that comes from RLHF itself or just from the way token prediction works.

It's likely the RLHF process since there are significant differences between models about this.
Sycophancy is not just a problem when you are asking for advice. Try to soundboard any new idea whatsoever, and it will just roll with everything you say, no matter how fallacious or absurd. If you ever manage to get an LLM to generate criticisms, they will be shallow and uninteresting.

And of course that is what it does, because there is no thinking involved! There is no logic. No consequence. No arithmetic. There is only continuation. An LLM can't continue a new idea, it can only continue a conversation about it.

An LLM does not have an opinion. Anything that looks like an opinion is just an emergent selection bias from its training corpus. LLMs are trained on what humans write, and human writing is kind and patient much more often than critical.

So what if we trained an LLM to be biased toward generating criticism? That would only replace the sycophant with a brick wall. What we really need is to find a way to bring logic and meaning into the system.

The tone and sensitivity thing is a real issue. A neutral prompt will get a neutral answer, but adding any emotional charge, it will immediately fold. That's not really a reasoning failure it's just a training problem. RLHF rewards whatever felt good in the moment, not whatever was actually correct. You can't prompt your way out of that one, when it's already in the weights.
yeah that's a good way to put it. the "felt good in the moment" framing is basically the whole problem. the reward model was trained on human preferences and humans preferred the agreeable answer, so now that's what you get at inference time regardless of whether it's correct. the frustrating part is you can see it happen in real time if you log the outputs turn by turn, the model will literally contradict its own previous response just because the user sounded more confident.