| The last graph is the most telling evidence that our current "general" models are pretty bad at any specific task all models tested are 15% more likely to pick the candidate presented first in the prompt all else being equal. This quote sums it up perfectly, the worst part is not the bias it's the false articulation of a grounded decision. "In this context, LLMs do not appear to act rationally. Instead, they generate articulate responses that may superficially seem logically sound but ultimately lack grounding in principled reasoning." I know some smart people who are convinced by LLM outputs in the way they can be convinced by a knowledgeable colleague. The model is usually good about showing its work but this should be thought of as an over-fitting problem especially if the prompt requested that a subjective decision be made. People need to realize that the current LLM interfaces will always sound incredibly reasonable even if the policy prescription it selects was a coin toss. |
That said, to a human also, the order in which candidates are presented to them will psychologically influence their final decision.