| I maintain a private evaluation set of what many call "misguided attention" questions. In many of these cases, the issue isnt failed logical reasoning. Its ambiguity, underspecified context, or missing constraints that allow multiple valid interpretations. Models often fail not because they can’t reason, but because the prompt leaves semantic gaps that humans silently fill with shared assumptions. A lot of viral "frontier model fails THIS simple question" examples are essentially carefully constructed token sequences designed to bias the statistical prior toward an intuitively wrong answer. Small wording changes can flip results entirely. If you systematically expand the prompt space around such questions—adding or removing minor contextual cues you'll typically find symmetrical variants where the same models both succeed and fail. That suggests sensitivity to framing and distributional priors (adding unnecessary info, removing clear info, add ambiguity, ...), not necessarily absence of reasoning capability. |