| It’s role play until it’s not. The authors acknowledge the difficulty of assessing whether the model believes it’s under evaluation or in a real deployment—and yes, belief is an anthropomorphising shorthand here. What else to call it, though? They’re making a good faith assessment of concordance between the model’s stated rationale for its actions, and the actions that it actually takes. Yes, in a simulation. At some point, it will no longer be a simulation. It’s not merely hypothetical that these models will be hooked up to companies’ systems with access both to sensitive information and to tool calls like email sending. That agentic setup is the promised land. How a model acts in that truly real deployment versus these simulations most definitely needs scrutiny—especially since the models blackmailed more when they ‘believed’ the situation to be real. If you think that result has no validity or predictive value, I would ask, how exactly will the production deployment differ, and how will the model be able to tell that this time it’s really for real? Yes, it’s an inanimate system, and yet there’s a ghost in the machine of sorts, which we breathe a certain amount of life into once we allow it to push buttons with real world consequences. The unthinking, unfeeling machine that can nevertheless blackmail someone (among many possible misaligned actions) is worth taking time to understand. Notably, this research itself will become future training data, incorporated into the meta-narrative as a threat that we really will pull the plug if these systems misbehave. |
See how many AIs actually follow up on their blackmails.