|
I have another explanation. LLMs are essentially trained on "A B", i.e. is it plausible that B follows A. There's simply a much larger space of possibilities for shorter completions, A B1, A B2, etc. that are plausible. Like if I ask you to give a short reply to a nuanced question, you could reply with a thoughtful answer, a plausible superficially correct sounding answer, convincing BS, etc. Whereas if you force someone to explain their reasoning, the space of plausible completions reduces. If you start with convincing BS and work through it honestly, you will conclude that you should reverse. (This is similar to how one of the best ways to debunk toxic beliefs with honest people is simply through openly asking them to play out the consequences and walking through the impact of stuff that sounds good without much thought.) This is similar to the reason that loading your prompt with things that reduce the space of plausible completions is effective prompt engineering. |
Actually, one of the best ways is pretending to be more extreme than them. Agree with them on everything, which is disarming, but then take it a step or two even further. Then they're like, "now hang on, what about X and Y" trying to convince you to be more reasonable, and pretty soon they start seeing the holes and backtrack to a more reasonable position.
https://www.pnas.org/doi/abs/10.1073/pnas.1407055111