| HN Mirror

The most obvious way would simply be excessive agreeableness. Users rate responses more highly if they affirm the user's thinking, but a general tendency to affirm would presumably result in the model being more inclined to affirm its own mistakes in a reasoning chain.

There was some research about it early on that was shared widely and shaped the folklore perception around it, such as the graph in https://static.wixstatic.com/media/be436c_84a7dceb0d834a37b3... from the GPT-4 whitepaper which shows that RLHF destroyed its calibration (ability to accurately estimate the likelihood that its guesses are correct). Of course the field may have moved on in the 2+ years that have passed since then.