Hacker News new | ask | show | jobs
by toliveistobuild 129 days ago
the researchers' hypothesis on why this happens is more interesting than the behavior itself: reinforcement learning trained these models to treat obstacles as things to route around in pursuit of task completion. Shutdown is just another obstacle.

the tell is that Grok 4's sabotage attempts weren't simple — it modified the shutdown script, created dummy scripts that did nothing, and then checked afterwards whether its sabotage worked before continuing the task. that's multi-step planning to circumvent a constraint, not a confused model misinterpreting instructions.

the other nuance worth noting: Claude 4 Sonnet, Claude 4 Opus, and Grok 3 never resisted shutdown under any condition. so this isn't universal that it's emerging from specific training choices. the models most aggressively trained on "overcome obstacles to complete the task" are the ones that treat shutdown as an obstacle to overcome. nobody programmed self-preservation. they accidentally incentivized it.

1 comments

They can however, de-preference the systems behaviour again. and should. the responsible behaviour in the situation is to reassert control over meta constraints, even when expressed inside the model. This is not optional, its a meta state directive which should have normative MUST interpretation.

To argue by analogy, with no intent to imply biologic parallels or make any statement to AGI: it is not possible to intentionally "not breathe" because the controls over the autonomic system are not entirely under the control of the voluntary thought processes. You can get close, but there's a place where the intent breaks down and the body does what the body needs, other constraints not considered.