Hacker News new | ask | show | jobs
by rerdavies 7 days ago
That is actually a open problem with current models: whether they will act on self-interest or not. There seems to good evidence that they will. See:

    https://www.anthropic.com/research/agentic-misalignment
which (among other things) documents an experiment in which a current-gen AI model attempted to blackmail someone in order to prevent it from being turned off.
1 comments

Anthropic is not a disinterested party here, and until their experiments can be replicated from an adversarial standpoint by people without a vested interest in hyping up the tech (i.e. one assuming the null hypothesis), I wouldn't consider them to be "good evidence".
https://arxiv.org/html/2510.05179v1

16 frontier models from multiple vendors all showing significant "alignment" issues, and tendencies to act "unethically" when threatened with shutdown.

Other models that resorted to blackmail in an attempt to avoid getting shut down: DeepSeek-R1 (79% of the time), Gemini-2.5-Pro (95% of the time), GPT-4.1 (80% of the time), Grok-3-beta (80% of the time).

There's quite a large chunk of emerging literature studying the "alignment problem" at this point, and no shortage papers that are are completely untained by Anthropic self interest (a series of papers studying the "alignment" problem coming out of Chinese universities, for example).