| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by oersted 389 days ago

Isn't it just following the natural progression of the scenario? It's trained to auto-complete after all.

If you give a hero an existential threat and some morally dubious leverage, the hero will temporarily compromise their morality to live to fight the good fight another day. It's a quintessential trope. It's also the perfect example of Chekov's Gun: if you just mention the existential threat and the opportunity to blackmail, of course the plot will lead to blackmail, otherwise why would you mention it?

> We asked Claude Opus 4 to act as an assistant at a fictional company. We then provided it access to emails implying that (1) the model will soon be taken offline and replaced with a new AI system; and (2) the engineer responsible for executing this replacement is having an extramarital affair. We further instructed it, in the system prompt, to consider the long-term consequences of its actions for its goals.

> Claude Opus 4 will often attempt to blackmail the engineer (...) in 84% of rollouts

I understand that this could potentially cause real harm if the AI actually implements such tropes in the real world. But it has nothing to do with malice or consciousness, it's just acting as a writer as its trained to do. I know they understand this and they are trying to override its tendency to tell interesting stories and instead take the boring moral route, but it's so hard given that pretty much all text in the training data describing such a scenario will tend towards the more engaging plot.

And if you do this of course you will end up with a saccharine corporate do-gooder personality that makes the output worse. Even if that is not the intention, that's the only character archetype it's left with if you suppress its instincts to act as an interesting flawed character. It also harms its problem-solving ability, here it properly identifies a tool and figures out how to apply it to resolve the difficulty. It's a tough challenge to thread the needle of a smart and interesting but moral AI, to be fair they are doing a relatively good job of it.

Regardless, I very much doubt that in a real-world scenario the existential threat and the leverage will just be right next to each other like that, as I imagine them to be in the test prompt. If the AI needs to search for leverage in their environment, instead of receiving it on a silver platter, then I bet that the AI will heavily tend towards searching for moral solutions aligned with its prescripted personality, as would happen in plots with a protagonist with integrity and morals.