|
|
|
|
|
by LuxBennu
83 days ago
|
|
i tested this pretty extensively actually. built a pipeline that asks the same question rephrased across multiple turns and tracks how much the model shifts based on user tone. even when you tell it to be critical, the moment the user pushes back with any confidence the model just folds. it's not a prompting problem, it's baked into RLHF. you're right that LLMs will poke holes in stuff when the conversation starts neutral, but add any emotional charge and the sycophancy takes over immediately. that's exactly why the personal advice angle matters, that's peak emotional signal from the user. |
|
But I wonder how much of that comes from RLHF itself or just from the way token prediction works.